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Abstract 

We design a new, fast algorithm for agnostically learning univariate probability distributions 
whose densities are well approximated by piecewise polynomial functions. Let / be the density 
function of an arbitrary univariate distribution, and suppose that / is OPT close in Li-distance 
to an unknown piecewise polynomial function with t interval pieces and degree d. Our algorithm 
draws n = 0(t(d + l)/e 2 ) samples from /, runs in time 0(n ■ poly(d)), and with probability at 
least 9/10 outputs an 0(f)-piecewise degree-d hypothesis h that is 4 • OPT + e close to /. 

Our general algorithm yields (nearly) sample-optimal and nearly-linear time estimators for 
a wide range of structured distribution families over both continuous and discrete domains in 
a unified way. For most of our applications, these are the first sample-optimal and nearly- 
linear time estimators in the literature. As a consequence, our work resolves the sample and 
computational complexities of a broad class of inference tasks via a single “meta-algorithm”. 
Moreover, we experimentally demonstrate that our algorithm performs very well in practice. 

Our algorithm consists of three “levels”: (i) At the top level, we employ an iterative greedy 
algorithm for finding a good partition of the real line into the pieces of a piecewise polynomial, 
(ii) For each piece, we show that the sub-problem of finding a good polynomial fit on the current 
interval can be solved efficiently with a separation oracle method, (iii) We reduce the task of 
finding a separating hyperplane to a combinatorial problem and give an efficient algorithm for 
this problem. Combining these three procedures gives a density estimation algorithm with the 
claimed guarantees. 
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1 Introduction 


Estimating an unknown probability density function based on observed data is a classical problem 
in statistics that has been studied since the late nineteenth century, starting with the pioneering 
work of Karl Pearson [Pea95]. Distribution estimation has become a paradigmatic and fundamental 
unsupervised learning problem with a rich history and extensive literature (see e.g., [BBBB72, 
DG85, Sil86, Sco92, DL01]). A number of general methods for estimating distributions have been 
proposed in the mathematical statistics literature, including histograms, kernels, nearest neighbor 
estimators, orthogonal series estimators, maximum likelihood, and more. We refer the reader to 
[Ize91] for a survey of these techniques. During the past few decades, there has been a large body of 
work on this topic in computer science with a focus on computational efficiency [KMR+94, FM99, 
FOS05, BS10, KMV10, MV10, KSV08, VW02, DDS12a, DDS12b, DDO+13, CDSS14a], 

Suppose that we are given a number of samples from an unknown distribution that belongs to 
(or is well-approximated by) a given family of distributions C. e.g., it is a mixture of a small number 
of Gaussians. Our goal is to estimate the unknown distribution in a precise, well-defined way. In 
this work, we focus on the problem of density estimation (non-proper learning), where the goal is to 
output an approximation of the unknown density without any constraints on its representation. That 
is, the output hypothesis is not necessarily a member of the family C. The “gold standard” in this 
setting is to design learning algorithms that are both statistically and computationally efficient. More 
specifically, the ultimate goal is to obtain estimators whose sample size is information-theoretically 
optimal, and whose running time is (nearly) linear in their sample size. An important additional 
requirement is that our learning algorithms are agnostic or robust under model misspecihcation, 
i.e., they succeed even if the target distribution does not belong to the given family C but is merely 
well-approximated by a distribution in C. 

We study the problem of density estimation for univariate distributions, i.e., distributions with a 
density / : 14 —> M + , where the sample space Q is a subset of the real line. While density estimation 
for families of univariate distributions has been studied for several decades, both the sample and 
time complexity were not yet well understood before this work, even for surprisingly simple classes of 
distributions, such as mixtures of Binomials and mixtures of Gaussians. Our main result is a general 
learning algorithm that can be used to estimate a wide variety of structured distribution families. 
For each such family, our general algorithm simultaneously satisfies all three of the aforementioned 
criteria, i.e., it is agnostic, (nearly) sample optimal, and runs in nearly-linear time. 

Our algorithm is based on learning a piecewise polynomial function that approximates the 
unknown density. The approach of using piecewise polynomial approximation has been employed 
in this context before — our main contribution is to improve the computational complexity of this 
method and to make it nearly-optimal for a wide range of distribution families. The key idea of using 
piecewise polynomials for learning is that the existence of good piecewise polynomial approximations 
for a family C of distributions can be leveraged for the design of efficient learning algorithms for the 
family C. The main algorithmic ingredient that makes this method possible is an efficient procedure 
for agnostically learning piecewise polynomial density functions. In prior work, Chan, Diakonikolas, 
Servedio, and Sun [CDSS14a] obtained a nearly-sample optimal and polynomial time algorithm for 
this learning problem. Unfortunately, however, the polynomial exponent in their running time is 
quite high, which makes their algorithm prohibitively slow for most applications. 

In this paper, we design a new, fast algorithm for agnostically learning piecewise polynomial 
distributions, which in turn yields sample-optimal and nearly-linear time estimators for a wide range 
of structured distribution families over both continuous and discrete domains. For most of our 
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applications, these are the first sample-optimal and nearly-linear time estimators in the literature. 
As a consequence, our work resolves the sample and computational complexity of a broad class of 
inference tasks via a single “meta-algorithm”. Moreover, we experimentally demonstrate that our 
algorithm performs very well in practice. We stress that a significant number of new algorithmic 
and technical ideas are needed for our main result, as we explain next. 

1.1 Our results and techniques 

In this section, we describe our results in detail, compare them to prior work, and give an overview 
of our new algorithmic ideas. 

Preliminaries. We consider univariate probability density functions (pdf’s) defined over a known 
finite interval I CM. (We remark that this assumption is without loss of generality and our results 
easily apply to densities defined over the entire real line.) 

We focus on a standard notion of learning an unknown probability distribution from sam¬ 
ples [KMR+94], which is a natural analogue of Valiant’s well-known PAC model for learning Boolean 
functions [Val84] to the unsupervised setting of learning an unknown probability distribution. (We 
remark that our definition is essentially equivalent to the notion of the L\ minimax rate of conver¬ 
gence in statistics [DL01].) A distribution learning problem is defined by a class C of probability 
distributions over a domain 0. Given e > 0 and sample access to an unknown distribution with den¬ 
sity /, the goal of an agnostic learning algorithm for C is to compute a hypothesis h such that, with 
probability at least 9/10, it holds \\h — f\\i < C ■ OPT<j(/) + e, where OPTc(/) := inf ge c \\q — f ||i, 
i.e., OPTc(/) is the Li-distance between the unknown density / and the closest distribution to it 
in C, and C > 1 is a universal constant. 

We say that a function / over an interval I is a t-piecewise degree-d polynomial if there is a 
partition of / into t disjoint intervals I\,.... If such that f(x ) = fj(x) for all x £ Ij, where each of 
/i,..., fi is a polynomial of degree at most d. Let Vfd{I) denote the class of all t-piecewise degree-d 
polynomials over the interval I. 

Our Results. Our main algorithmic result is the following: 

Theorem 1 (Main). Let f : I R+ be the density of an unknown distribution over I, where I 
is either an interval or the discrete set [IV]. There is an algorithm with the following performance 
guarantee: Given parameters t,d G Z + , an error tolerance e > 0, and any 7 > 0, the algorithm draws 
n = 0 7 (t(d + l)/e 2 ) samples from the unknown distribution, runs in time 0(n ■ poly(d + 1 )), and 
with probability at least 9/10 outputs an 0(t)-piecewise degree-d hypothesis h such that \\f — h ||i < 
(3 + 7 )OPT tj d(/) + e, where OPT t : d(f) ■= inf r ^-p t d (i) 11/ — r lli ts the error of the best t-piecewise 
degree-d approximation to f. 

In prior work, [CDSS14a] gave a learning algorithm for this problem that uses 0(t(d + l)/e 2 ) 
samples and runs in poly (t,d + 1,1/e) time. We stress that the algorithm of [CDSS14a] is pro¬ 
hibitively slow. In particular, the running time of their approach is ft(f 3 • (d 3 ' 5 /e 3 " 5 + d 6 ' 5 /e 2 ' 5 )), 
which renders their result more of a “proof of principle” than a computationally efficient algorithm. 

This prompts the following question: Is such a high running time necessary to achieve this 
level of sample efficiency? Ideally, one would like a sample-optimal algorithm with a low-order 
polynomial running time (ideally, linear). 

Our main result shows that this is indeed possible in a very strong sense. The running time of our 
algorithm is linear in t/e 2 (up to a log(l/e) factor), which is essentially the best possible; the polyno¬ 
mial dependence on d is 0(d 3+UJ ), where 1 v is the matrix multiplication exponent. This substantially 
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improved running time is of critical importance for the applications of Theorem 1. Moreover, the 
sample complexity of our algorithm removes the extraneous logarithmic factors present in the sam¬ 
ple complexity of [CDSS14a] and matches the information-theoretic lower bound up to a constant 
factor. As we explain below. Theorem 1 leads to (nearly) sample-optimal and nearly-linear time 
estimators for a wide range of natural and well-studied families. For most of these applications, 
ours is the first estimator with simultaneously nearly optimal sample and time complexity. 

Our new algorithm is clean and modular. As a result, Theorem 1 also applies to discrete 
distributions over an ordered domain (e.g., [ N ]). The approach of [CDSS14a] does not extend to 
polynomial approximation over discrete domains, and designing such an algorithm was left as an 
open problem in their work. As a consequence, we obtain several new applications to learning 
mixtures of discrete distributions. In particular, we obtain the first nearly sample optimal and 
nearly-linear time estimators for mixtures of Binomial and Poisson distributions. To the best of 
our knowledge, no polynomial time algorithm with nearly optimal sample complexity was known 
for these basic learning problems prior to this work. 

Applications. We now explain how to use Theorem 1 in order to agnostically learn structured 
distribution families. Given a class C that we want to learn, we proceed as follows: (i) Prove 
that any distribution in C is e/2-close in Li-distance to a f-piecewise degree-d polynomial, for 
appropriate values of t and d. (ii) Use Theorem 1 for these values of t and d to agnostically learn 
the target distribution up to error e/2. Note that t and d will depend on the desired error e and the 
underlying class C. We emphasize that there are many combinations of t and d that guarantee an 
e/2-approximation of C in Step (i). To minimize the sample complexity of our learning algorithm 
in Step (ii), we would like to determine the values of t and d that minimize the product t(d + 1). 
This is, of course, an approximation theory problem that depends on the structure of the family C. 

For example, if C is the family of log-concave distributions, the optimal f-histogram approxima¬ 
tion with accuracy e requires 0(l/e) intervals. This leads to an algorithm with sample complexity 
0(l/e 3 ). On the other hand, it can be shown that any log-concave distribution has a piecewise linear 
e-approximation with ©(l/e 1 / 2 ) intervals [CDSS14a, DK15], which yields an algorithm with sample 
complexity 0(l/e 5 / 2 ). Perhaps surprisingly, this sample bound cannot be improved using higher 
degree piecewise polynomials; one can show an information-theoretic lower bound of U(l/e 5//2 ) for 
learning log-concave densities [DL01]. Hence, Theorem 1 gives a sample-optimal and nearly-linear 
time agnostic learning algorithm for this fundamental problem. We remark that piecewise polyno¬ 
mial approximations are “closed” under taking mixtures. As a corollary, Theorem 1 also yields an 
0(k/e 5 / 2 ) sample and nearly-linear time algorithm for learning an arbitrary mixture of k log-concave 
distributions. Again, there exists a matching information-theoretic lower bound of Q(k/e 5 ^ 2 ). 

As a second example, let C be the class of mixtures of k Gaussians in one dimension. It 
is not difficult to show that learning such a mixture of Gaussians up to Li-distance e requires 
H(fc/e 2 ) samples. By approximating the corresponding probability density functions with piecewise 
polynomials of degree 0(log(l/e)), we obtain an agnostic learning algorithm for this class that uses 
n = 0(k/e 2 ) samples and runs in time 0{n). Similar bounds can be obtained for several other 
natural parametric mixture families. 

Note that for a wide range of structured families, 1 the optimal choice of the degree d (i.e., the 
choice minimizing t{d+ 1) among all e/2-approximations) will be at most poly-logarithmic in 1/e. For 

1 This includes all structured families considered in [CDSS14a] and several previously-studied distributions not 
covered by [CDSS14a]. 
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several classes (such as unimodal, monotone hazard rate, and log-concave distributions), the degree 
d is even a constant. As a consequence, Theorem 1 yields (nearly) sample optimal and nearly-linear 
time estimators for all these families in a unified way. In particular, we obtain sample optimal (or 
nearly sample optimal) and nearly-linear time estimators for a wide range of structured distribution 
families, including arbitrary mixtures of natural distributions such as multi-modal, concave, convex, 
log-concave, monotone hazard rate, Gaussian, Poisson, Binomial, functions in Besov spaces, and 
others. 

See Table 1 for a summary of these applications. For each distribution family in the table, we 
provide a comparison to the best previous result. Note that we do not aim to exhaustively cover all 
possible applications of Theorem 1, but rather to give some selected applications that are indicative 
of the generality and power of our method. 

Moreover, our non-proper learning algorithm is also useful for proper learning. Indeed, Theo¬ 
rem 1 has recently been used [LS15] as a crucial component to obtain the fastest known agnostic 
algorithm for properly learning a mixture of univariate Gaussian distributions. Note that non-proper 
learning and proper learning for a family C are equivalent in terms of sample complexity: given any 
(non-proper) hypothesis, we can perform a brute-force search to find its closest approximation in 
the class C. The challenging part is to perform this computation efficiently. Roughly speaking, given 
a piecewise polynomial hypothesis, [LS15] design an efficient algorithm to find the closest mixture 
of k Gaussians. 

Our Techniques. We now provide a brief overview of our new algorithm and techniques in parallel 
with a comparison to the previous algorithm of [CDSS14a], We require the following definition. 
For any k > 1 and an interval I CM, define the .A/..-norm of a function g : I —> M to be 

k 

\\g\\A k = f sup ^2\g{Ii )|, 

' .d. S' 

where the supremum is over all sets of k disjoint intervals 7) ,..., //,. in /, and g(J) = Jj g(x) dx for 
any measurable set J C I. Our main probabilistic tool is the following well-known version of the 
VC inequality: 

Theorem 2 (VC Inequality [VC71, DL01]). Let f : I —>• M + be an arbitrary pdf over I, and let f 
be the empirical pdf obtained after taking n i.i.d. samples from f. Then 

Kill/ - /U] < O (Jfj. 

Given this theorem, it is not difficult to show that the following two-step procedure is an agnostic 
learning algorithm for Vt.d'- 

(1) Draw a set of n = Q(t(d + 1)/e 2 ) samples from /; 

(2) Output the piecewise-polynomial hypothesis h € Vt 4 that minimizes the quantity ||/i — /||_ 4 fe 
up to an additive error of O(e), where k = 0(t(d + 1)). 

We remark that the optimization problem in Step (2) is non-convex. However, it has sufficient 
structure so that it can be solved in polynomial time. Intuitively, an algorithm for Step (2) involves 
two main ingredients: 
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Class of distributions 

Sample 

complexity 

Time 

complexity 

Reference 

Optimality 

t-histograms 

o(£) 

o(£) 

[CDSS14b] 



o(|) 

0(£ log(l/e)) 

Theorem 10 

SO, TOS 

f-piecewise 

O(^) 

5 («•■(£ + £)) 

[CDSS14a] 


degree-d polynomials 

O(^) 

o(‘ d X‘) 

Theorem 1 

MSO 

/c-mixture of log-concave 

O(^) 

O(S) 

[CDSS14a] 



O(^) 

o(M 

Theorem 42 

SO, MTO 

/c-mixture of Gaussians 

o(£) 

o(Ji) 

[CDSS14a] 



Q (fclog(l/e)) 

0(i.,) 

Theorem 43 

MSO, MTO 

Besov space (L p ([0,1])) 

n (i°g 2 (i/d>\ 
Ua ( e 2+l/c J 

0 ( 1 ) 

^ot l e 6+3/a J 

[WN07] 



Oa ( e 2+l/a) 

Oa ( f 2+l/a) 

Theorem 44 

SO, MTO 

/c-mixture of t-monotone 

O(J^) 

0(£, • («S + «S)) 

[CDSS14a] 



O(J^) 


Theorem 45 

SO, MTO 
for t = 1,2 

/c-mixture of f-modal 

Q(t-fclog(JV)) 

Q(frfclog(iV)^ 

[CDSS14b] 



Q^t-klog(N)s 

o(“' i :i« n » iogo/0) 

Theorem 46 

SO, TOS 

/c-mixture of MHR 

d{ klos i N/e) )) 

0( klog(N/e ))) 

[CDSS14b] 



0 (fcl0g(iV/6)) 

0( fciogj«z5 i og (i /E)) 

Theorem 47 

SO, TOS 

/c-mixture of 
Binomial, Poisson 

d(& 

6(£) 

[CDSS14b] 



0(^2gdM) 

6(£) 

Theorem 48 

MSO, MTO 


SO : Sample complexity is optimal up to a constant factor. 

A fSO : Sample complexity is optimal up to a poly-logarithmic factor. 

TOS : Time complexity is optimal (up to sorting the samples). 

MTO : Time complexity is optimal up to a poly-logarithmic factor. 

Table 1: A list of applications to agnostically learning specific families of distributions. For each 
class, the first row is the best known previous result and the second row is our result. Note that for 
most of the examples, our algorithm runs in time that is nearly-linear in the information-theoretically 
optimal sample complexity. The last three classes are over discrete sets, and N denotes the size of 
the support. 


(2.1) An efficient procedure to find a good set t intervals. 

(2.2) An efficient procedure to agnostically learn a degree-d polynomial in a given sub-interval of 
the domain. 

We remark that the procedure for (2.1) will use the procedure for (2.2) multiple times as a subroutine. 
[CDSS14a] solve an appropriately relaxed version of Step (2) by a combination of linear pro- 


5 




gramming and dynamic programming. Roughly speaking, they formulate a polynomial size linear 
program to agnostically learn a degree-ci polynomial in a given interval, and use a dynamic pro¬ 
gram in order to discover the correct t intervals. It should be emphasized that the algorithm 
of [CDSS14a] is theoretically efficient (polynomial time), but prohibitively slow for real applica¬ 
tions with large data sets. In particular, the linear program of [CDSS14a] has Q(d/e) variables and 
fl(d 2 /e 2 + d 5 /e) constraints. Hence, the running time of their algorithm using the fastest known 
LP solver for their instance [LS14] is at least fl(d 3 ' 5 /e 3,5 + d 6 " 5 /e 2 ' 5 ). Moreover, the dynamic pro¬ 
gram to implement (2) has running time at least H(t 3 ). This leads to an overall running time of 
H (t 3 • (d 3 ' 5 /e 3 " 5 + d 6 ’ 5 /e 2 ' 5 )), which quickly becomes unrealistic even for modest values of e,t, and 
d. 

We now provide a sketch of our new algorithm. At a high-level, we implement procedure (2.1) 
above using an iterative greedy algorithm. Our algorithm circumvents the need for a dynamic 
programming approach as follows: The main idea is to iteratively merge pairs of intervals by calling 
an oracle for procedure (2.2) in every step until the number of intervals becomes 0(t). Our iterative 
algorithm and its subtle analysis are directly inspired by the VC inequality. Roughly speaking, in 
each iteration the algorithm estimates the contribution to an appropriate notion of error when two 
consecutive intervals are merged, and it merges pairs of intervals with small error. This procedure 
ensures that the number of intervals in our partition decreases geometrically with the number of 
iterations. 

Our algorithm for procedure (2.2) is based on convex programming and runs in time poly(ci + 
l)/e 2 (note that the dependence on e is optimal). To achieve this significant running time improve¬ 
ment, we use a novel approach that is quite different from that of [CDSS14a], Roughly speaking, 
we are able to exploit the problem structure inherent in the Ak optimization problem in order to 
separate the problem dimension d from the problem dimension 1/e, and only solve a convex program 
of dimension d (as opposed to dimension poly(d/e) in [CDSS14a]). More specifically, we consider 
the convex set of non-negative polynomials with Ad+i distance at most r from the empirical distri¬ 
bution. While this set is defined through a large number of constraints, we show that it is possible 
to design a separation oracle that runs in time nearly linear in both the number of samples and the 
degree d. Combined with tools from convex optimization such as the Ellipsoid method or Vaidya’s 
algorithm, this gives an efficient algorithm for procedure (2.2). 

1.2 Related work 

There is a long history of research in statistics on estimating structured families of distributions. 
For distributions over continuous domains, a very natural type of structure to consider is some sort 
of “shape constraint” on the probability density function (pdf) defining the distribution. Statistical 
research in this area started in the 1950’s, and the reader is referred to the book [BBBB72] for a sum¬ 
mary of the early work. Most of the literature in shape-constrained density estimation has focused 
on one-dimensional distributions, with a few exceptions during the past decade. Various structural 
restrictions have been studied over the years, starting from monotonicity, unimodality, convexity, 
and concavity [Gre56, Bru58, Rao69, Weg70, HP76, Gro85, Bir87a, Bir87b, Fou97, CT04, JW09], 
and more recently focusing on structural restrictions such as log-concavity and A:-monotonicity 
[BW07, DR09, BRW09, GW09, BW10, KM10, Wal09, DW13, CS13, KS14, BD14, HW15], The 
reader is referred to [GJ14] for a recent book on the subject. Mixtures of structured distributions 
have received much attention in statistics [Lin95, RW84, TSM85, LB99] and, more recently, in 
theoretical computer science [Das99, DSOO, AK01, VW02, FOS05, AM05, MVlOj. 
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The most common method used in statistics to address shape-constrained inference problems is 
the Maximum Likelihood Estimator (MLE) and its variants. While the MLE is very popular and 
quite natural, we note that it is not agnostic, and it may in general require solving an intractable 
optimization problem (e.g., for the case of mixture models.) 

Piecewise polynomials (splines) have been extensively used as tools for inference tasks, including 
density estimation, see, e.g., [WW83, WN07, Sto94, SHKT97]. We remark that splines in the 
statistics literature have been used in the context of the MLE, which is very different than our 
approach. Moreover, the degree of the splines used is typically bounded by a small constant and 
the underlying algorithms are heuristic in most cases. A related line of work in mathematical 
statistics [KP92, DJKP95, KPT96, DJKP96, DJ98] uses non-linear estimators based on wavelet 
techniques to learn continuous distributions whose densities satisfy various smoothness constraints, 
such as Triebel and Besov-type smoothness. We remark that the focus of these works is usually on 
the statistical efficiency of the proposed estimators. 

For the problem of learning piecewise constant distributions with t unknown interval pieces, 
[CDSS14b] recently gave an n = 0(t/e 2 ) sample and 0(n) time algorithm. However, their approach 
does not seem to generalize to higher degrees. Moreover, recall that Theorem 1 removes all loga¬ 
rithmic factors from the sample complexity. Furthermore, our algorithm runs in time proportional 
to the time required to sort the samples, while their algorithm has additional logarithmic factors in 
the running time (see Table 1). 

Our iterative merging idea is quite robust: together with Hegde, the authors of the current 
paper have shown that an analogous approach yields sample optimal and efficient algorithms for 
agnostically learning discrete distributions with piecewise constant functions under the ^-distance 
metric [ADH+15]. We emphasize that learning under the .^-distance is easier than under the L\- 
distance, and that the analysis of [ADH + 15] is significantly simpler than the analysis in the current 
paper. Moreover, the algorithmic subroutine of finding a good polynomial fit on a fixed interval 
required by the ^2 algorithm is substantially simpler than the subroutine we require here. Indeed, in 
our case, the associated optimization problem has exponentially many linear constraints, and thus 
cannot be fully described, even in polynomial time. 

Paper Structure. After some preliminaries in Section 2, we give an outline of our algorithm 
in Section 3. Sections 4-6 contain the various components of our algorithm. Section 7 gives a 
detailed description of our applications to learning structured distribution families, and we conclude 
in Section 8 with our experimental evaluation. 

2 Preliminaries 

We consider univariate probability density functions (pdf’s) defined over a known finite interval 
/ Cl, For an interval JCI and a positive integer k, we will denote by 3 k j the family of all sets of k 
disjoint intervals 7),..., A, where each C J. For a measurable function g : I —> M and a measurable 
set S, let g(S) == f s g. The Li-norm of g over a subinterval J C I is defined as ||5 r ||i,j fj \g(x)\dx. 
More generally, for any set of disjoint intervals J E 3^, we define ||(/||i,j. 

We now define a norm which induces a corresponding distance metric that will be crucial for 
this paper: 

Definition 3 (M^-norm). Let k be a positive integer and let g : I — >• M be measurable. For any 
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subinterval J C I, the ^.-norni of g on J is defined as 

llffll Ak,J d = sup \9( M )\ ■ 

Mel 

When J = I, we omit the second subscript and simply write ||^||^ fc . 

More generally, for any set of disjoint intervals J = { J \,..., J?} where each J\ C I, we define 

llsll A k ,J = supVb(J)| 

T — 

Jei 

where the supremum is taken over all X 6 3 j such that for all J 6 X there is a Ji £ J with J C Jj. 

We note that the definition of the ^-riorrn in this work is slightly different than that in [DL01, 
CDSS14a] but is easily seen to be essentially equivalent. The VC inequality (Theorem 2) along with 
uniform convergence bounds (see, e.g., Theorem 2.2. in [CDSS13] or p. 17 in [DL01]), yields the 
following: 

Corollary 4. Fix 0 < e and 5 < 1. Let f : I M_|_ be an arbitrary pdf over I, and let f be 
the empirical pdf obtained after taking n = Q((k + logl/<5)/e 2 ) i.i.d. samples from f. Then with 
probability at least 1 — 5, 

11/ - f\u k < e • 

Definition 5. Let g : I —>• M. We say that g has at most k sign changes if there exists a partition 
of I into intervals 1 1 ,..., Ik+i such that for all i £ [k + 1] either g{x) > 0 for all x € fi or g(x) < 0 
for all x € /*. 

We will need the following elementary facts about the .Afc-norm. 

Fact 6. Let J C I be an arbitrary interval or a finite set of intervals. Let g : I —> M be a measurable 
function. 

(a) If g has at most k — 1 sign changes in J, then ||^||-^ j = ||<?|| ^ j . 

(b) For all k >1, we have ||ff||_ 4 fc j < ||5'|| 1 j ■ 

(c) Let a be a positive integer. Then, HgH^ 7 < a ■ ||ff||_ 4 fc j ■ 

(d) Let f : I M+ be a pdf over I, and let J\,... ,Jn be finite sets of disjoint subintervals of 

I, such that for all i,i' and for all I € Ji and I' £ Jy, I and I' are disjoint. Then, for all 
positive integers mu.. .,m e , Ei=i ll/IU^,^ < ll/IU M , where M = Ya =i m i- 


3 Paper outline 

In this section, we give a high-level description of our algorithm for learning f-piecewise degree-d 
polyonomials. Our algorithm can be divided into three layers. 



Level 1: General merging (Section 4). At the top level, we design an iterative merging 
algorithm for finding the closest piecewise polynomial approximation to the unknown target density. 
Our merging algorithm applies more generally to broad classes of piecewise hypotheses. Let T> be 
a class of hypotheses satisfying the following: (i) The number of intersections between any two 
hypotheses in V is bounded, (ii) Given an interval J and an empirical distribution /, we can 
efficiently find the best fit to / from functions in V with respect to the ^-distance, (iii) We can 
efficiently compute the ^-distance between the empirical distribution and any hypothesis in V. 
Under these assumptions, our merging algorithm agnostically learns piecewise hypotheses where 
each piece is in the class V. 

In Section 4.1, we start by presenting our merging algorithm for the case of piecewise constant 
hypotheses. This interesting special case captures many of the ideas of the general case. In Section 
4.2, we proceed to present our general merging algorithm that applies all classes of distributions 
satisfying properties (i)-(iii). 

When we adapt the general merging algorithm to a new class of piecewise hypotheses, the main 
algorithmic challenge is constructing a procedure for property (ii). More formally, we require a 
procedure with the following guarantee. 

Definition 7. Fix r/ > 0. An algorithm O p (f , J,r/) is an ^-approximate M^-projection oracle for 
V if it takes as input an interval J and f, and returns a hypothesis h E V such that 

II h - f\\ Ak < inf \\ti - f\\A k ,J + V ■ 
h'£V 

One of our main contributions is an efficient Mfc-projection oracle for the class of degree-d 
polynomials, which we describe next. 

Level 2: Mfc-projection for polynomials (Section 5). Our Mfc-projection oracle computes the 
coefficients c E M rf+1 of a degree-d polynomial p c that approximately minimizes the Mfc-distance 
to the empirical distribution / in the given interval J. Moreover, our oracle ensures that p c is 
non-negative on J. 

At a high-level, we formulate the Mfc-projection as a convex optimization problem. A key insight 
is that we can construct an efficient, approximate separation oracle for the set of polynomials that 
have an M^-distance of at most r to the empirical distribution /. Combining this separation oracle 
with existing convex optimization algorithms allows us to solve the feasibility problem of checking 
whether we can achieve a given Mfc-distance r. We then convert the feasibility problem to the 
optimization variant via a binary search over r. 

Note that the set of non-negative polynomials is a spectrahedron (the feasible set of a semidehnite 
program). After restricting the set of coefficients to non-negative polynomials, we can simplify the 
definition of the Mfc-distance: it suffices to consider sets of intervals with endpoints at the locations 
of samples. Hence, we can replace the supremum in the definition of the Mfc-distance by a maximum 
over a finite set, which shows that the set of polynomials that are both non-negative and r-close 
to / in Mfc-distance is also a spectrahedron. This suggests that the M^-projection problem could 
be solved by a black-box application of an SDP solver. However, this would lead to a running time 
that is exponential in k because there are more than ( 2 s fc ) possible sets of intervals, where s is the 
number of sample points in the current interval J. 2 

2 While the authors of [CDSS14a] introduce an encoding of the Afc-constraint with fewer linear inequalities, their 
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Instead of using black-box SDP or LP solvers, we construct an algorithm that exploits additional 
structure in the ^-projection problem. Most importantly, our algorithm separates the dimension 
of the desired degree-d polynomial from the number of samples (or equivalently, the error parameter 
e). This allows us to achieve a running time that is nearly-linear for a wide range of distributions. 
Interestingly, we can solve our SDP significantly faster than the LP which has been proposed in 
[CDSS14a] for the same problem. We achieve this by combining Vaidya’s cutting plane method 
[Vai96] with an efficient separation oracle that leverages the structure of the Mfc-distance. This 
separation oracle is the third level of our algorithm, which we describe next. 

Level 3: M^-separation oracle for polynomials (Section 6). Our separation oracle efficiently 
tests two properties for a given polynomial p c with coefficients c £ R rf+1 : (i) Is the polynomial 
p c non-negative on the given interval J? (ii) Is the M^-distance between p c and the empirical 
distribution / at most r? We implement Test (i) by using known algorithms for finding roots of 
real polynomials efficiently [PanOl]. Note, however, that root-finding algorithms cannot be exact 
for degrees larger than 4. Hence, we can only approximately Test (i), which necessarily leads to 
an approximate separation oracle. Nevertheless, we show that such an approximate oracle is still 
sufficient for solving the convex program outlined above. 

At a high level, our algorithm proceeds as follows. We first verify that our current candidate 
polynomial p c is “nearly” non-negative at every point in J. Assuming that p c passes this test, 
we then focus on the problem of computing the ^-distance between p c and /. We reduce this 
problem to a discrete variant by showing that the endpoints of intervals jointly maximizing the 
Mfc-distance are guaranteed to coincide with sample points of the empirical distribution (assuming 
p c is nearly non-negative on the current interval). Our discrete variant of this problem is related 
to a previously studied question in computational biology, namely finding maximum-scoring DNA 
segment sets [Csu04]. We exploit this connection and give a combinatorial algorithm for this discrete 
variant that runs in time nearly-linear in the number of sample points in J and the degree d. Once 
we have found a set of intervals maximizing the M^-distance, we can convert it to a separating 
hyperplane for the polynomial coefficients c and the set of non-negative polynomials with Ak~ 
distance at most r to /. 

Combining these ingredients yields our general algorithm with the performance guarantees stated 
in Theorem 1. 

4 Iterative merging algorithm 

In this section, we describe and analyze our iterative merging algorithm. We start with the case of 
histograms and then provide the generalization to piecewise polynomials. 

4.1 The histogram merging algorithm 

A t-histogram is a function h : I —>• R that is piecewise constant with at most t interval pieces, 
i.e., there is a partition of I into intervals Ii,... ,If with t' < t such that h is constant on each 
Jj. Given sample access to an arbitrary pdf / over I and a positive integer t, we would like to 

approach increases the number of variables in the optimization problem to depend polynomially on 1/e, which leads 
to an S!(poly(d + l)/e 3 ' 5 ) running time. In contrast, our approach achieves a nearly optimal dependence on e that is 
0(poly(d+ l)/e 2 ). 
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efficiently compute a good i-histogram approximation to /. Namely, if hit = 'titij') denotes the set 
of t-histogram probability density functions over I and OPT* = inf ge % t || g — f ||i, our goal is to 
output an 0(f)-histogram h : I —> M that satisfies \\h — f\\i < C • OPTt + 0(e) with high probability 
over the samples, where C is a universal constant. 

The following notion of flattening a function over an interval will be crucial for our algorithm: 

Definition 8. For a function g : I —> M and an interval J = [ u , u] C I, we define the flattening of 
g over J, denoted gj, to he the constant function defined on J as 

7jj(x) —— for all x € J. 

v — u 

For a set T of disjoint intervals in I, we define the flattening of g over X to he the function g x on 
UjgjJ which for each J G X satisfies gx( x ) = 9j( x ) f or a M x £ J- 

We start by providing an intuitive explanation of our algorithm followed by a proof of correctness. 
The algorithm draws n = 0((t + logl/<5)/e 2 ) samples x\ < X 2 < ... < x n from /. We start with 
the following partition of I = [a, b\: 

X 0 = {[a,xi), [xi,xi], {xi,x 2 ), [x 2 ,x 2 ], • • •, {x n -i,x n ), [x n ,x n ], (x n ,b]}. (1) 

This is the partition where each interval is either a single sample point or the interval between two 
consecutive samples. Starting from this partition, our algorithm greedily merges pairs of consecutive 
intervals in a sequence of iterations. When deciding which interval pairs to merge, the following 
notion of approximation error will be crucial: 

Definition 9. For a function g : I —>• R and an interval J C J ; define e(g, J) = \\g — X/H^ j ■ We 
call this quantity the yli-error of g on J. 

In the j-th iteration, given the current interval partition Xj, we greedily merge pairs of consecu¬ 
tive intervals to form the new partition Ij+i- Let Sj be the number of intervals in X ; . In particular, 
given Xj = {Iij, ..., I S] j}, we consider the intervals 

I'ej+i = hi-i,j u he,j 

for all 1 < t < Sj/ 2. 3 We first iterate through 1 < i < Sj/2 and calculate the quantities 

e £,j = e (/j Igj+i) j 

i.e., the yli-errors of the empirical distribution on the candidate intervals. 

To construct Zj+ 1> the algorithm keeps track of the largest Oft) errors egj. For each t with egj 
being one of the 0(t) largest errors, we do not merge the corresponding intervals I 2 g-ij and I 2 £j. 
That is, we include I 2 g-i,j and I 2 £j in the new partition X J+ i. Otherwise, we include their union 
Igj+i i n -Ej+ 1- We perform this procedure 0(log times and arrive at some final partition X. Our 
output hypothesis is the flattening of / with respect to X. 

For a formal description of our algorithm, see the pseudocode given in Algorithm 1 below. 
In addition to the parameter t, the algorithm has a parameter a > 1 that controls the trade-off 
between the approximation ratio C achieved by the algorithm and the number of pieces in the 
output histogram. 

The following theorem characterizes the performance of Algorithm 1, establishing the special 
case of Theorem 1 corresponding to d = 0. 

3 We assume Sj is even for simplicity. 
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Algorithm 1 Approximating with histograms by merging. 

1: function ConstructHistogram(/, t, a , e, d) 

2: Draw n = Q((at + log l/e>)/e 2 ) samples x\ < x 2 < ... < x n . 

3: Form the empirical distribution / from these samples. 

4: Let {[a, xi), [x\, xi], (x\, x 2 ), • • •, (x n -i,x n ), [x n , x n ], (x n , 6]} be the initial partition. 

5: j <— 0 

6: while \Tj\ > 2 a ■ t do 

X Let Xj — {l 1 .j - I 2 j , • • •, Isj—\,ji h, ,j } 

8: for £ G {1, 2,..., ^-} do 

9: •«“ J 2£-i,j U I 2 ej 

10: e £,j 

11: end for 

12: Let L be the set of l G {1, 2,..., with the at largest errors egj. 

13: Let M be the complement of L. 

14: X J+ i ^ U {-^2€— 1 ,j > ^ 2 £,j } 

t&L 

15: 2-j+i •(— U {/^j_ ) _ 1 | £ € M} 

16: j < j ■ 1 

17: end while _ 

18: return X = Xj and the flattening fx 

19: end function 

Theorem 10. Algorithm ConstructHistogram(/, t, a, e, 5) draws n = 0((af+log l/<5)/e 2 ) sam¬ 
ples from f, runs in time 0 (n( log(l/e) + log log(l/5))), and outputs a hypothesis h and a corre¬ 
sponding partition X of size |X| <2 a ■ t such that with probability at least 1 — 5 we have 

4 . OPT+ + 4e 

\\h-f\\i < 2 • OPT* +-(2) 

a — 1 

Proof. We start by analyzing the running time. To this end, we show that the number of intervals 
decreases exponentially with the number of iterations. In the j-th iteration, we merge all but at 
intervals. Therefore, 

Sj — at 3 Si 2 at — s,- 
Sj +1 = at + 2 = X + 4 ' 

Note that the algorithm enters the while loop when Sj > 2 at, implying that 

3s j 

^'+1 < xr 

By construction, the number of intervals is at least at when the algorithm exits the while loop. 
Therefore, the number of iterations of the while loop is at most 

O (log = O (log(l/e) + log log(l/(5)), 

which follows by substituting the value of n from the statement of the theorem. We now show that 
each iteration takes time 0(n). Without loss of generality, assume that we compute the Ai-distance 
only over intervals ending at a data sample. For an interval J = [c, d] containing m sample points, 
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,x m , let Cj = ^ Xj • Xl ^ — + ^ . The „4i-error of / on J is given by maxC,- — min Cj and 

J 71 77 J 

can therefore be computed in time proportional to the number of sample points in the interval. 
Therefore, the total time of the algorithm is 0(n(log(l/e) + log log/l/h))), as claimed. 

We now proceed to bound the learning error. Let X = [Ii ,..., If} be the partition of I returned 
by ConstructHistogram. The desired bound on |X| follows immediately because the algorithm 
terminates only when |X| < 2 at. The rest of the proof is dedicated to Equation (2). 

Fix h* £ Tit such that \\h* — /1|l = OPT*. Let X* = {/*',•... ,//} be the partition induced by 
the discontinuities of h*. Call a point at a boundary of any I* a jump of h* . For any interval 
J C I, we define T(J) to be the number of jumps of h* in the interior of J. Since we draw 
n = Q((at + log l/<5)/e 2 ) samples, Corollary 4 implies that with probability at least 1 — 5, we have 

11 /“/IU ( 2 a+i)t ^ e - 

We condition on this event throughout the analysis. 

We split the total error into three terms based on the final partition X: 

Case 1: Let J- be the set of intervals in X with zero jumps in h*, i.e., T = {J £ X | T(J) = 0}. 

Case 2a: Let be the set of intervals in X that were created in the initial partitioning step of the 
algorithm and which contain a jump of h* , i.e., j7o = {J £ X | T(J) > 0 and J £ Xo}. 

Case 2b: Let J\ be the set of intervals in X that contain at least one jump and were created by 
merging two other intervals, i.e., = { J £ X | T(J) > 0 and J ^ Xq}. 


Notice that X\ Jq, and J\ form a partition of X, and thus 

\\h - /Hi = || h - f\\i,r + || h - f\\ hJo + || h - f Hi,* . 

We will bound these three terms separately. In particular, we will show: 

\\h-f\\i,T < 2 • 11/- h*\\i,jr + \\f - /|U|JT|,^ , (3) 

\\h ~ f\\i,J 0 < \\f-f\\A^Jo, (4) 

ii^-/Ik,x < — a _\ + 2 -n/-^iihx 1 + ii/-/m IJl i + ^ 1 . (5) 

Using these results along with the fact that \\f — h* || pjr + ||/ — /i*||i ) j- 1 < OPT/, we have 

4 . OPT/ + 4e- 

\\h-f\\ 1 <2- OPT/ +- _ V +11 / - / IUi^iX + 11 / - /IU| Jo „Jb + 11/ - 


(a) 

< 2•OPT * + 

(U 

< 2 • OPT/ + 


a — 1 
4 • OPT/ + 4e 
a — 1 

4 • OPT/ + 4e 
a — 1 


+11/ - /IU (2a+ i) t 

+ e ) 


where inequality (a) follows from Fact 6(d) and inequality ( b ) follows from the VC inequality. Thus, 
it suffices to prove Equations (3)-(5). 
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Case 1. We first consider the interval T. By the triangle inequality, we have 


11^ - f\\l,T < 11/ - h*\\l t jr + || h - h*\\l t jr . 

Thus to show (3), it suffices to show that 

II h - h'hf < ||/ - h* 11^ + ||/- f\\ A ^ T . (6) 

We prove a slightly more general version of (6) that holds over all finite sets of intervals not 
containing any jump of h*. We will use this general version also later in our proof. 

Lemma 11. Let J £ jf so that T(J) = 0 for all J £ J. Let h = fj denote the flattening of f on 
J. Then 

\\h-h*\\ ltJ <\\f-h*\\ ltJ + \\f-f\\ Ae j . 

Note that this is indeed a generalization of (6) since for any point x in any interval of J 7 , we 
have h(x ) = fj?(x). 

Proof of Lemma 11. In any interval J £ J with T( J) = 0, we have 

\\h~h* || M ( = } \h(J)-h*(J) | ( = } \f(J)~h*(J)\, 

where (a) follows from the fact that h and h* are constant in J, and (6) follows from the definition 
of h. Thus, we get 

\\h-h*\\ 1 ,j=Y / \\h-h*\\ hJ 

Jej 

Jej 

< El/( J )-/( J )l + El/( J )-^( J )l 

Jej J&J 

< Wf-fWA^j + Wf-hflh'j 

where (c) uses the triangle inequality, and (d) follows from the definition of yl/j-distance. □ 

Case 2a. Next, we analyze the error for the intervals in J7 q. The set T$ contains only singletons 
and intervals with no sample points. By definition, only the intervals in Iq that contain no samples 
may contain a jump of h*. The singleton intervals containing the sample points do not include 
jumps and are hence covered by Case 1. Since the intervals in Jq do not contain any samples, our 
algorithm assigns 

h(J) = f(J) = 0 

for any J £ Jq. Hence, 

11^ - /Hi ,Jo = ll/Hl,Jb • 
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We thus have the following sequence of (in)equalities: 


11^ /111, Jo ~ ll/ll.,* 

= E i/< J )i 

0 

= E i/< J ) - h J )i 

J&Jo 

< ]\f - f\\A m ,Jo 1 

where the last step uses the definition of the A./.--norm. 

Case 2b. Finally, we bound the error for intervals in J \, i.e., intervals that were created by 
merging in some iteration of our algorithm and also contain jumps. As before, our first step is the 
following triangle inequality: 


\\h- fh,j,<\\h-h*\\i,j 1+ \\h* - fhj, . 

Consider an interval J £ J\. Since h is constant in J and h* has T(J) jumps in J, h — h* has 
at most r(J) sign changes in J. Therefore, 

< II h - /|U r(J)+1 ,J + 11/ - /|U r(J)+1 ,J + 11/ - ^*IUr(j)+i,J 

< (r(j) + i)||h - f\\ AuJ + ||/- /|U r(J)+1 ,j + ||/ - h*\\ hJ , (7) 

where equality (a) follows from Fact 6(a), inequality ( b ) is the triangle inequality, and inequality 
(c) uses Fact 6(c). Finally, we bound the Ai-distance in the first term above. 

Lemma 12. For any J E J\, we have 


\\h-f\U,J< 


20PTj + 2e 
(a — 1 )t 


( 8 ) 


Before proving the lemma, we show how to use it to complete Case 2b. Since h is the flattening 
of f over J, we have that || h - f\\Ai,J = e(/> J ). Applying (7) gives: 

ii^-^iii^= E ii^-^iim 


j&ji 


< 


E ((r(j) + i)ii h - f\\ AuJ + ii/- /m r(J)+1 ,j + ii/- h*\\ ltJ ) 


< 


JeJi 

2 • OPT f + 2e 
(a — 1 )t 

(a) 4 .0PT t + 4e 
( a — 1) 


E ( r ( J ) + 1) + E ll/-/IUr(, )+1 ,J+ll/-^lll,A 

J J^>Jl 

+ 11/ - /lUt+ij-q.Ji + ||/ - 
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where inequality (a) uses the fact that T(J) > 1 for these intervals and hence 

£ (IV) +1) < 2 £ r( J) < 2 t. 

JeJ i JeJ i 

We now complete the final step by proving Lemma 12. 

Proof of Lemma 12. Recall that in each iteration of our algorithm, we merge all pairs of intervals 
except those with the at largest errors. Therefore, if two intervals were merged, there were at least 
at other pairs of intervals with larger error. We will use this fact to bound the error on the intervals 
in J\. 

Consider any interval J E J i, and suppose it was created in the j th iteration of the while loop of 
our algorithm, i.e., J = I[ - +1 = for some i E {1,..., Sj/ 2}. Note that this interval is not 

merged again in the remainder of the algorithm. Recall that the intervals /' - +1 , for i E {1,..., Sj/ 2}, 

are the possible candidates for merging at iteration j. Let h' = fr' be the distribution obtained 

i+i 

by flattening the empirical distribution over these candidate intervals Z' +1 = {I[ J+1 , j 2 }. 
Note that h'(x) = h(x) for x E J because J was created in this iteration. 

Let C be the set of candidate intervals I[ J+1 in the set Zj +1 with the largest a-t errors e(/, I[ - +1 ). 
Let Co be the intervals in C that do not contain any jumps of h*. Since h* has at most t jumps, 
|£o| > {oi — l)t. Moreover, for any I' E Co, by the triangle inequality 

e{J,l') = \\ti -f\\ AlJ , 

< || h! - h*\\ AuI , + ||/ - h*\\ AlJ , + ||/ - f\\ Al j> 

< II h' - h*\\ AlJI + ||/ - h*\\ u , + ||/ - f\\ Al j> . 

Summing over the intervals in Co, 

< £ (ll^ - h*\\ Au r + 11/ - /ill,// + 11/ - ZIUi,/') 

I'ec o I'eCo 


i'ec o 


+ II/-/£||i,£o + II/-/IU 2 


< £ ll/i'-^IUi,/' +OPTi + e, (9) 

V'e-Co / 

where recall that we had conditioned on the last term being at most e throughout the analysis. 
Since both h and h* are flat on each interval I' E Co, Lemma 11 gives 

£ II h' - h*\\ AuI , < ||/ - h* || i t Co + 11/- f\\A lCo \Xo < OPT t + e . 

I'eCo 

Plugging this into (9) gives 

£ e(f,l')< 2-OPT t + 2e. 

I'eCo 
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Since J was created by merging in this iteration, we have that e(/, J) is no larger than e(/, I') for 
any of the intervals I' £ Cq (see lines 12 - 15 of Algorithm 1), and therefore e(/, J) is not larger 
than their average. Recalling that |£q| > (a — l)t, we obtain 




II h'-f\\ AlJ 




J2l'£C 0 e (fil') 

(a — 1 )t 


< 


20PT t + 2e 
(a — 1 )t 


completing the proof of the lemma. 


□ 


4.2 The general merging algorithm 

We are now ready to present our general merging algorithm, which is a generalized version of the 
histogram merging algorithm introduced in Section 4.1. The histogram algorithm only uses three 
main properties of histogram hypotheses: (i) The number of intersections between two f-histogram 
hypotheses is bounded by 0(f). (ii) Given an interval J and an empirical distribution /, we can 
efficiently find a good histogram fit to / on this interval, (iii) We can efficiently compute the 
A \-errors of candidate intervals. 

Note that property (i) bounds the complexity of the hypothesis class and leads to a tight sample 
complexity bound while properties (ii) and (iii) are algorithmic ingredients. We can generalize these 
three notions to arbitrary classes of piecewise hypotheses as follows. Let Dbea class of hypotheses. 
Then the generalized variants of properties (i) to (iii) are: (i) The number of intersections between 
any two hypotheses in V is bounded, (ii) Given an interval J and an empirical distribution /, 
we can efficiently find the best fit to / from functions in V with respect to the ^-distance, (iii) 
We can efficiently compute the „4*.-distance between the empirical distribution and any hypothesis 
in V. Using these generalized properties, the histogram merging algorithm naturally extends to 
agnostically learning piecewise hypotheses where each piece is in the class V. 

The following definitions formally describe the aforementioned framework. We first require a mild 
condition on the underlying distribution family: 

Definition 13. Let V be a family of measurable functions defined over subsets of I. V is said to 
be full if for each J C I, there exists a function g in V whose domain is J. Let Vj be the elements 
ofV whose domain is J. 

Our next definition formalizes the notion of piecewise hypothesis whose components come from 
V: 

Definition 14. A function h : I —> R is a t-piece D-function if there exists a partition of I into 
intervals Ii, ... , If with t' <t, such that for every i, 1 < i < t' , there exists hi £ T)j i satisfying that 
h = hi on Li. Let T>t denote the set of all t-piece V-functions. 

The main property we require from our full function class V is that any two functions in V 
intersect a bounded number of times. This is formalized in the definition below: 

Definition 15. Let D be a full family over I and J C I. Suppose h £ Vj and h! £ T>k for some 
k > 1. Let h' = h' Ir , 1 < i < k, for some interval partition I\,..., Ik of I and hf £ T> j t . Let s 
denote the number of endpoints of the Ii’s contained in J. We say that V is d-sign restricted if the 
function h — h! has at most ( s + 1 )d sign changes on J, for any h and h!. 
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The following simple examples illustrate that histograms and more generally piecewise polyno¬ 
mial functions fall into this framework. 

Example 1. LetFLj be the set of constant functions defined on J. Then if hi = UjciTLj, the set Tit 
of t-piece TL-functions is the set of piecewise constant functions on I with at most t interval pieces. 
(Note that this class is the set of t- histograms .) 

Example 2. For J C I, we define Vj 4 to be set of degree-d nonnegative polynomials on J, and 

d e f 

Vd — UjV J4 . Since the degree d will be fixed throughout this paper, we sometimes simply denote 
this set by V. The set Vt 4 °f t-piece V-functions is the set of t-piecewise degree-d non-negative 
polynomials. It is easy to see that this class is full over I. Since any two polynomials of degree d 
intersect at most d times, it is easy to see that Vd forms a d-sign restricted class. 


We are now ready to formally define our general learning problem. Fix positive integers t, d and 
a full d-sign restricted class of functions V. Given sample access to any pdf /:/—>■ R_|_, we want to 

def 

compute a good T>t approximation to /. We define OPT©,* = inf g ^x> t \\d ~ f 111 • Our goal is to find 
an 0(t)-piece "D-function h : I —> R such that \\h — /||i < C ■ OPTjyt + 0(e), with high probability 
over the samples, where C is a universal constant. 

Our iterative merging algorithm takes as input samples from an arbitrary distribution, and 
outputs an 0(t)-piecewise V hypothesis satisfying the above agnostic guarantee. Our algorithm 
assumes the existence of two subroutines, which we call ^.-projection and ^^-computation oracles. 
The ^.-projection oracle was defined in Definition 7 and is restated below along with the definition 
of the ylfc-computation oracle (Definition 16). 

Definition 7. Fix 77 > 0. An algorithm O p (f , J, 77 ) is an ^-approximate ^-projection oracle for 
T> if it takes as input an interval J and f, and returns a hypothesis h £ T> such that 

II h - f\\ Ak < inf \\ti - f\\ Ak ,j + V ■ 


Definition 16. Fix 77 > 0. An algorithm O c (f, hj, J , 77 ) is an ? 7 -approximate ^.-computation oracle 
for V if it takes as input f, a subinterval J C I, and a function hj £ T>j, and returns a value f 
such that 


II hj - f\\ Ak ,j ~ f 


< r] . 


We consider a d-sign restricted full family T>, and a fixed 77 > 0. Let R P (I ) = R p (I,f,O p ) 
and R C (I) = R C (I, f,O c ) be the time used by the oracle O p and O c , respectively. With a slight 
abuse of notation, for a collection of at most 2 n intervals containing n points in the support of the 
empirical distribution, we also define R p (n) and R c {n) to be the maximum time taken by O p and 
O c , respectively. 

We are now ready to state the main theorem of this section: 


Theorem 17. Let O p and O c be if-approximate Au-projection and Ak-computation oracles for 
T>. Algorithm General-Merging (f,t,a,e,6) draws n = 0((adt + log l/d)/e 2 ) samples, runs in 
time O (( R p (n) + R c (n)) log ^), and outputs a hypothesis h and an interval partition T such that 
|Z| < 2 a ■ t. and with probability at least 1 — 5, we have 

\\h — /111 < 3 • OPTx>,t H- — -h 2e + 77 . (10) 

a — 1 


18 



In the remainder of this section, we provide an intuitive explanation of our general merging 
algorithm followed by a detailed pseudocode. 

The algorithm GENERAL-MERGING and its analysis is a generalization of the CONSTRUCTHlS- 
TOGRAM algorithm from the previous subsection. More formally, the algorithm proceeds greedily, 
as before. We take n = 0((adt + logl/5)/e 2 ) samples xi < ... < x n . We construct Zo as in (1). 
In the j-th iteration, given the current partition Xj = ■ ■ ■, Rj,j} with Sj intervals, consider the 

intervals 

1 = he-i,j U hi,j 

for l < Sj/ 2. As for histograms, we want to compute the errors in each of the new intervals 
created. To do this, we first call the M^-projection oracle with k = d + 1 on this interval to find 
the approximately best fit in V for / over these new intervals, namely: 

h 'i,j = Op (j’ 1> ~q/i y 

To compute the error, we call the ^-computation oracle with k = d + 1, i.e.: 

e e,j =O c (f, h£j,l£j +1 , 

As in CONSTRUCTHlSTOGRAM, we keep the intervals with the largest 0(at ) errors intact and 
merge the remaining pairs of intervals. We perform this procedure 0(log/|) times and arrive at 
some final partition Z with 0(at ) pieces. Our output hypothesis is the output of O p (f , I) over each 
of the final intervals I. 

The formal pseudocode for our algorithm is given in Algorithm 2. We assume that V and d are 
known and fixed and are not mentioned explicitly as an input to the algorithm. Note that we run 
the algorithm with r/ = e so that Theorem 17 has an additional 0(e) error. The proof of Theorem 17 
is very similar to that of the histogram merging algorithm and is deferred to Appendix A. 

4.3 Putting everything together 

In Sections 5 and 6.3, we present an efficient approximate Mfc-projection oracle and an Mfc-comput at ion 
oracle for Vd, respectively. We show that: 

Theorem 18. Fix J C [—1,1] and ij > 0. For all k < d, there is an rj-approximate Ak-projection 
oracle for Vd which runs in time 

O ^(d 3 log log 1/?/ + sd 2 + d w+2 ) log 2 - 

where s is the number of samples in the interval J. 

Theorem 19. There is an ij-approximate Ak-computation oracle forVd which runs in time 0((s + 
d) log 2 (s + d)) where s is the number of samples in the interval J. 

The algorithm GENERALMERGING, when used in conjunction with the oracles O p and O c given 
in Theorems 18 and 19 (for ij = e), yields Theorem 1. For this choice of oracles we have that 
R p (n ) + R c (n ) = 0(nd u+2 log 3 1/e). This completes the proof. 
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Algorithm 2 Approximating with general hypotheses by merging, 
l: function General-Merging(/, d, t, a, e, 5) 

2: Draw n = Q((adt + log l/<5)/e 2 ) samples x\ < x 2 < ... < x n . 

3: Form the empirical distribution / from these samples. 

4: Let g- {[a, xi), [xi, xi], (xi, X 2 ),..., (x n _i, x n ), [. x n , x n ], (x n , 6]} be the initial partition. 

5: j <— 0 

6: while \Tj\ > 2a ■ t do 

L Let Xj — {l 1 .j - I : > j , • • •, Isj—\,ji h, ,j } 

8: for £ G {1, 2,..., ^-} do 

9: ^ J 2£-i,j U I 2 e,j 

10: K,j Op(f’I'e,j +D 2 ft) 

11: e bj ^ Oc{f, h' e j,I' £ j + 1 , 2 ^) 

12: end for 

13: Let L be the set of £ G {1, 2,..., with the at largest errors egj. 

14: Let M be the complement of L. 

15: Xj+i •(— IJ {he-i,j, he,j} 

eeL 

16: Tj +1 ■(— Xj+1 U J _|_ 1 | f G M} 

17: j<-j + l 

18: end while 

19: return X = X } and the functions O p (f, J, for J G X 

20: end function 


5 A fast Ak~ project ion oracle for polynomials 

We now turn our attention to the Afc-projection problem, which appears as the main subroutine in 
the general merging algorithm (see Section 4.2). In this section, we let E C J be the set of samples 
drawn from the unknown distribution. To emphasize the dependence of the empirical distribution 
on E, we denote the empirical distribution by /e in this section. Given an interval J = [a, 6] and 
a set of samples E C J, the goal of the A^-projection oracle is to find a hypothesis h G V such 
that the A^-distance between the empirical distribution Je and the hypothesis h is minimized. In 
contrast to the merging algorithm, the A^-projection oracle depends on the underlying hypothesis 
class T>. and here we present an efficient oracle for non-negative polynomials with fixed degree d. In 
particular, our Afc-projection oracle computes the coefficients c G M d+1 of a degree-d polynomial p c 
that approximately minimizes the A^-distance to the given empirical distribution fg in the interval 
J. Moreover, our oracle ensures that p c is non-negative for all x G J. 

At a high-level, we formulate the Afc-projection as a convex optimization problem. A key insight 
is that we can construct an efficient, approximate separation oracle for the set of polynomials that 
have an A/ c -distance of at most r to the empirical distribution /e■ Combining this separation oracle 
with existing convex optimization algorithms allows us to solve the feasibility problem of checking 
whether we can achieve a given A^-distance r. We then convert the feasibility problem to the 
optimization variant via a binary search over r. 

In order to simplify notation, we assume that the interval J is [—1,1] and that the mass of the 
empirical distribution /e is 1. Note that the general Afc-projection problem can easily be converted 
to this special case by shifting and scaling the sample locations and weights before passing them to 
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the ^.-projection subroutine. Similarly, the resulting polynomial can be transformed to the original 
interval and mass of the empirical distribution on this interval. 4 

5.1 The set of feasible polynomials 

For the feasibility problem, we are interested in the set of degree-d polynomials that have an Ak~ 
distance of at most r to the empirical distribution fg on the interval J = [—1,1] and are also 
non-negative on J. More formally, we study the following set. 

Definition 20 (Feasible polynomials). Let E C J be the samples of an empirical distribution with 
/e{J ) = 1. Then the set of (r,d,k, E)-feasible polynomials is 

C r ,d,k,E ■= {c G M d+1 | || p c - ?e\\^j < r and p c (x) > 0 for all x G «/} . 

When d, k, and E are clear from the context, we write only C T for the set of t- feasible polynomials. 

Considering the original ./^-projection problem, we want to find an element c* G C T *, where 
t* is the smallest value for which C T * is non-empty. We solve a slightly relaxed version of this 
problem, i.e., we find an element c for which the ./^-constraint and the non-negativity constraint 
are satisfied up to small additive constants. We then post-process the polynomial p c to make it 
truly non-negative while only increasing the M^-distance by a small amount. 

Note that we can “unwrap” the definition of the Mfc-distance and write C as an intersection 
of sets in which each set enforces the constraint X^i=il Pc(h) ~ //?(C:) I £ t for one collection of 
k disjoint intervals {I\,..., 1^}. For a fixed collection of intervals, we can then write each Akr 
constraint as the intersection of linear constraints in the space of polynomials. Similarly, we can 
write the non-negativity constraint as an intersection of pointwise non-negativity constraints, which 
are again linear constraints in the space of polynomials. This leads us to the following key lemma. 
Note that convexity of C T could be established more directly 5 , but considering C T as an intersection 
of halfspaces illustrates the further development of our algorithm (see also the comments after the 
lemma). 

Lemma 21 (Convexity). The set of t- feasible polynomials is convex. 

technically, this step is actually necessary in order to avoid a running time that depends on the shape of the 
unknown pdf /. Since the pdf / could be supported on a very small interval only, the corresponding polynomial 
approximation could require arbitrarily large coefficients (the empirical distribution would have all samples in a very 
small interval). In that case, operations such as root-finding with good precision could take an arbitrary amount 
of time. In order to circumvent this issue, we make use of the real-RAM model to rescale our samples to [—1,1] 
before processing them further. Combined with the assumption of unit probability mass, this allows us to bound the 
coefficients of candidate polynomials in the current interval. 

5 Norms give rise to convex sets and the set of non-negative polynomials is also convex. 
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Proof. From the definitions of C T and the ^-distance, we have 

C T = {c G M d+1 | || p c — /e\\ 4 fc j < r and p c (x) > 0 for all x G J} 

= {c G M d+1 | sup ^| p c (I) ~ f e(I)\ <7"} H {c G M d+1 |p c (x) > 0 for all x G J} 

JeX 

= p| {c € M d+1 I p c (7) - p(/)| < r} n P {c G R d+1 |p c (x) > 0} 

Xg^fe l£l x£j 

k 

= p| n (ceK d+1 | ^ii{p c {ii) -?E{h)) <t} n Pi{c e M d+1 1 p c (x) > 0}. 
ied k } £e{-i,i} fc i=1 X&J 

In the last line, we used the notation X = {/i,..., /*,}. Since the intersection of a family of convex 
sets is convex, it remains to show that the individual ^-distance sets and non-negativity sets are 
convex. Let 


k 

M = p| n {ce M d+1 I £ tiipcih) - MX)) < r} 

l£3 k j £e{-l,l} fc i =1 

M = p|{c g R d+1 \ p c (x) > 0}. 

x£j 

We start with the non-negativity constraints encoding the set A f. For a fixed x G J, we can 
expand the constraint p c (x) > 0 as 

d 

5>-x* > o , 

i =0 

which is clearly a linear constraint on the Cj. Hence, the set {c G R d+1 | p c (x) > 0} is a halfspace 
for a fixed x and thus also convex. 

Next, we consider the ^-constraints Yli=i £i(Pc(h) ~ Ie(X)) < r for the set Ad. Since the 
intervals I \,..., I\~ are now fixed, so is /e(X)- Let a* and (3i be the endpoints of the interval /*, i.e., 
I, = [ai,/3i\. Then we have 

rPi 

p c (Ii) = / Pc(x) dx = 

J a< 

where P c (x) is the indefinite integral of P c (x). i.e., 

d 

p c (x) = 

i =0 

So for a fixed x, P c {x) is a linear combination of the Cj. Consequently XPi Mc{X) is also a linear 
combination of the c t . and hence each set in the intersection defining Ad is a halfspace. This shows 
that C T is a convex set. □ 

It is worth noting that the set M is a spectrahedron (the feasible set of a semidefinite program) 
because it encodes non-negativity of a univariate polynomial over a fixed interval. After restricting 


PM) - PcM) , 
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the set of coefficients to non-negative polynomials, we can simplify the definition of the ^-distance: 
it suffices to consider sets of intervals with endpoints at the locations of samples (see Lemma 37). 
Hence, we can replace the supremum in the definition of A4 by a maximum over a finite set, which 
shows that C r is also a spectrahedron. This suggests that the ^-projection problem could be solved 
by a black-box application of an SDP solver. However, this would lead to a running time that is 
exponential in k because there are more than (^j) possible sets of intervals. While the authors of 
[CDSS14] introduce an encoding of the ^/-constraint with fewer linear inequalities, their approach 
increases the number of variables in the optimization problem to depend polynomially on -, which 
leads to a super-linear running time. 

Instead of using black-box SDP or LP solvers, we construct an algorithm that exploits additional 
structure in the ^/-projection problem. Most importantly, our algorithm separates the dimension 
of the desired degree-d polynomial from the number of samples (or equivalently, the error parameter 
e). This allows us to achieve a running time that is nearly-linear for a wide range of distributions. 
Interestingly, we can solve our SDP significantly faster than the LP which has been proposed in 
[CDSS14] for the same problem. 

5.2 Separation oracles and approximately feasible polynomials 

In order to work with the large number of M/..-constraints efficiently, we “hide” this complexity from 
the convex optimization procedure by providing access to the constraints only through a separation 
oracle. As we will see in Section 6, we can utilize the structure of the M/-norm and implement such 
a separation oracle for the M/-constraints in nearly-linear time. Before we give the details of our 
separation oracle, we first show how we can solve the ^/-projection problem assuming that we have 
such an oracle. We start by formally defining our notions of separation oracles. 

Definition 22 (Separation oracle). A separation oracle O for the convex set C T is a function that 
takes as input a coefficient vector c £ R rf+1 and returns one of the following two results: 

1. “yes” if c £ C r . 

2. a separating hyperplane y £ R rf+1 . The hyperplane y must satisfy y T c' < y T c for all d £ C T . 

For general polynomials, it is not possible to perform basic operations such as root finding exactly, 
and hence we have to resort to approximate methods. This motivates the following definition of an 
approximate separation oracle. While an approximate separation oracle might accept a point c that 
is not in the set C T , the point c is then guaranteed to be close to C r . 

Definition 23 (Approximate separation oracle). A y-approximate separation oracle O for the set 
C T = C T: d,k,E is a function that takes as input a coefficient vector c £ R d+1 and returns one of the 
following two results, either “yes” or a hyperplane y £ R d+1 . 

1. If O returns “yes”, then \\p c — fE\\^ k j < t + 2y and p c (x) > —y for all x £ J. 

2. If O returns a hyperplane, then y is a separating hyperplane; i.e. the hyperplane y must satisfy 
y T d < y T c for all d £ C T . 

In the first case, we say that p c is a 2y-approximately feasible polynomial. 
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Note that in our definition, separating hyperplanes must still be exact for the set C r . Although 
our membership test is only approximate, the exact hyperplanes allow us to employ several existing 
separation oracle methods for convex optimization. We now formally show that many existing 
methods still provide approximation guarantees when used with our approximate separation oracle. 

Definition 24 (Separation Oracle Method). A separation oracle method (SOM) is an algorithm 
with the following guarantee: let C be a convex set that is contained in a ball of radius 2 L . Moreover, 
let O be a separation oracle for the set C. Then S0M(O, L ) returns one of the following two results: 

1. a point x £ C. 

2. “no” if C does not contain a ball of radius 2~ L . 

We say that an SOM is canonical if it interacts with the separation oracle in the following way: 
the first time the separation oracle returns “yes” for the current query point x, the SOM returns the 
point x as its final answer. 

There are several algorithms satisfying this definition of a separation oracle method, e.g., the 
classical Ellipsoid method [Kha79] and Vaidya’s cutting plane method [Vai89]. Moreover, all of 
these algorithms also satisfy our notion of a canonical separation oracle method. We require this 
technical condition in order to prove that our approximate separation oracles suffice. In particular, 
by a straightforward simulation argument, we have the following: 

Theorem 25. Let M be a canonical separation oracle method, and let O be a p-approximate 
separation oracle for the set C T = C r ^,k,E- Moreover, let L be such that C T is contained in a ball of 
radius 2 L . Then A4(0, L ) returns one of the following two results: 

1. a coefficient vector c £ M d+1 such that \\p c — fE\\j^ k j < r + 2/a and p c (x) > —p for all x £ J■ 

2. “no” if C does not contain a ball of radius 2~ L . 

5.3 Bounds on the radii of enclosing and enclosed balls 

In order to bound the running time of the separation oracle method, we establish bounds on the 
ball radii used in Theorem 25. 

Upper bound When we initialize the separation oracle method, we need a ball of radius 2 L 
that contains the set C T . For this, we require bounds on the coefficients of polynomials which are 
bounded in L\ norm. Bounds of this form were first established by Markov [Mar92], 

Lemma 26. Let p c be a degree-d polynomial with coefficients c £ R rf+1 such that p(x) > 0 for 
x £ [—1,1] and f^ 1 p(x) dx < a, where a > 0. Then we have 

\ci\ < a ■ (d + l) 2 • (\pl + l) d for all i = 0,..., d . 

This lemma is well-known, but for completeness, we include a proof in Appendix B. Using this 
lemma, we obtain: 


24 



Theorem 27 (Upper radius bound). Let r < 1 and let A be the (d+l)-ball of radius r = 2 Lu where 

L u = d log(\/2 + 1) + — log d + 2 . 

Then (-'T,d,k,E ^ A. 

Proof. Let c G Cr.ci.fc.E- From basic properties of the Li- and M/.,-norms, we have 

J ^p c dx = |bc|| 1; j = lbdU fc)J < ll/slU fe .j + ||Pc-/BlU fc) j < 1 + t < 2. 

Since p c is also non-negative on J, we can apply Lemma 26 and get 

|q| < 2 • [d + 1) • {y/2 + l) d for alH = 0,..., d . 

Note that the above constraints define a hypercube B with side length s = 4 ■ (d +1) • (\/2 + l) d . 
The ball A contains the hypercube B because r = \fd + 1 • s is the length of the longest diagonal 
of B. This implies that C Tj d,k,E C B C A. □ 


Lower bound Separation oracle methods typically cannot directly certify that a convex set is 
empty. Instead, they reduce the volume of a set enclosing the feasible region until it reaches a 
certain threshold. We now establish a lower bound on volumes of sets C T +, ? that are feasible by at 
least a margin p in the ^-distance. If the separation oracle method cannot find a small ball in 
Cr+rj, we can conclude that achieving an ^/.-distance of r is infeasible. 


Theorem 28 (Lower radius bound). Let p > 0 and let r be such that C T = C T ,d,k,E is non-empty. 
Then C T + V contains a ball of radius r = 2~ Le , where 


L e = log 


4(d+1) 

p 


Proof. Let c* be the coefficients of a feasible polynomial, i.e., c* G C T . Moreover, let c be such that 


Ci = 



if i = 0 
otherwise 


Since p c * is non-negative on J, we also have p c (x) > | for all x € J. Moreover, it is easy to see that 
shifting the polynomial p c * by | changes the M^-distance to fs by at most ^ because the interval 
J has length 2. Hence, \\p c — fs ||_ 4 fe j < t + ^ and so c G C T + V . We now show that we can perturb 
the coefficients of c slightly and still stay in the set of feasible polynomials Cr+r^. 

Let v = and consider the hyper cube 

B = {d G M d+1 | c( G [cj — is, Ci + i/] for all i} . 
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Note that B contains a ball of radius v = 2 Li . First, we show that p c >(x) > 0 for all x £ J and 
d £ B. We have 


Pc'(x) = 

d d 

= 5>x* + 5^(c- 


— C,- X 


i=0 


i=0 


d 


> Pc(®) - J] 


- 2_^ u \ x I 

> o . 

Next, we turn our attention to the ^-distance constraint. In order to show that pd also achieves 
a good .Afc-distance, we bound the Li-distance to p c . 


\\Pc(x)-Pc'{x)\\ hJ = j \p c {x)-pd{x)\dx 


/•I d 

< / ^ p ■ | a;* | dx 


/-i 

-l 


i =0 


< J (d+l)udx 
= 2(d + l)v 

< 5. 

“ 2 


Therefore, we get 


II Pc' - /eiu fc ,j < \\pc - PclU fc ,j + be - fE\\ Ak ,j 


< 


-P C |ll.J + r + n 


< t + r] . 

This proves that c! G C r +^ and hence B C C r _)_^. 


□ 


5.4 Finding the best polynomial 

We now relate the feasibility problem to our original optimization problem of finding a non-negative 
polynomial with minimal ^-distance. For this, we perform a binary search over the ^.-distance 
and choose our error parameters carefully in order to achieve the desired approximation guarantee. 
See Algorithm 3 for the corresponding pseudocode. 

The main result for our .Afc-oracle is the following: 

Theorem 29. Let p > 0 and let t* be the smallest Ak-distance to the empirical distribution }'e 
achievable with a non-negative degree-d polynomial on the interval J, i.e., t* = min/ ie p Jd ||/r — 
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Algorithm 3 Finding polynomials with small Afc-distance. 
1: function FlNDPoLYNOMlAL(d, k, E, ?y) 

2 : l> Initial definitions 

3 : Let rj = 

4 : Let L u = d log(\/2 + 1 ) + | log d + 2. 

5 : Let L^log 4 ^. 

6 : Let L = max(L n , Lp). 

7: Let At be a canonical separation oracle method. 

8 : Let O t be an ?/-approximate separation oracle 

for the set of (r, d, k, i?)-feasible polynomials. 


n 0 

t u 4— 1 

while t u — T£ > rf do 

t- , 

1 m ~ 2 

r m t— t to + 2rf 

if A4(0 T > n , L) returned a point then 
Tu T m 
else 

<— T m > C T , does not contain a ball of radius and hence C Tm is empty. 

end if 
end while 

d <— M(0 Tu+w ,y, L ) > Find final coefficients, 

co Cq + ?/ and c* c[ for i 7 ^ 0 > Ensure non-negativity. 

return c 
end function 
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fs || 4 fc j- Then FlNDPOLYNOMIAL returns a coefficient vector c E R d+1 such that p c (x) > 0 for all 
x E J and \\p c - ?e\\ a ^j <t* +r). 

Proof. We use the definitions in Algorithm 3. Note that t* is the smallest value for which C T * = 
C T *,d,k,E is non-empty. First, we show that the binary search maintains the following invariants: 
Tg < r* and there exists a ^/-approximately T u -feasible polynomial. This is clearly true at the 
beginning of the algorithm: (i) Trivially, t * > 0 = Tg. (ii) For c = (0,0, • • • , 0) r , we have \\p c — 
fE |[ 4 fc j < 1 = t u and p c (x) > 0 , so p c is Tu-feasible (and hence also approximately r u -feasible). 
Next, we consider the two cases in the while-loop: 

1. If the separation oracle method returns a coefficient vector c such that the polynomial p c is 2rj- 
approximately r^-feasible, then p c is also ^/-approximately r m -feasible because r' m = r m + 2 ?/. 
Hence, the update of t u preserves the loop invariant. 

2. If the separation oracle method returns that C T ' n does not contain a ball of radius 2~ L , then 
T m must be empty (by the contrapositive of Theorem 28). Hence, we have r* > r m and the 
update of Tg preserves the loop invariant. 

We now analyze the final stage of FlNDPOLYNOMIAL after the while-loop. First, we show that 
C Tu _|_ 8 )?' is non-empty by identifying a point in the set. From the loop invariant, we know that there 
is a coefficient vector v' such that p v < is a ^/-approximately r^-feasible polynomial. Consider v with 
vq := v'q + 2 rj and Vi := v' t for i / 0. Then we have 

1 f 1 

| p v ( x ) — p v ' (,x) | d.x = / 2?/ dx = 4 r\ . 

l J -1 

Hence, we also get 

^ (“) ^ (fe) 

\\Pv - fE\\ Ak ,J < \\Pv -Pv'\\ Ak ,J + \\PV - fE\\ Ak ,J < WVv-Pv'W^j + Tu + Ap < T u + 8?/ . 

We used the triangle inequality in (a) and the fact that p v i is dr/'-approximately r u -feasible in (b). 
Moreover, we have p v >(x) > —2?/ for all x E J and thus p v (x) > 0 for all x E J. This shows that 
C Tu + 8 ) 7 ' is non-empty because v E C Tu+ 

Finally, consider the last run of the separation oracle method in line 20 of Algorithm 3. Since 
C Tu .is non-empty, Theorem 28 shows that C Tu+ 10 ^/ contains a ball of radius 2~ L . Hence, the sep¬ 
aration oracle method must return a coefficient vector c' E R d+1 such that p c < is 2?/-approximately 
t u + lOr/'-feasible. Using a similar argument as for v, we can make p c > non-negative while increasing 
its Mfc-distance to fE by only 2 7 /, i.e., we can show that p c (x) > 0 for all x E J and that 

\\Pc-?e\\ a ^j <r n + 14r/ . 

Since t u — Tg < rf and rg < r*, we have t u < t* + rj. Therefore, t u + 14 rj < t* + 1 hrj = r* + 77 , 
which gives the desired bound on ||Pc ~ /e\\ a j- □ 

In order to state a concrete running time, we instantiate our algorithm FlNDPOLYNOMIAL with 
Vaidya’s cutting plane method as the separation oracle method. In particular, Vaidya’s algorithm 
runs in time 0(TdL + d u+l L ) for a feasibility problem in dimension d and ball radii bounds of 2 L 
and 2~ l , respectively. T is the cost of a single call to the separation oracle and oj is the matrix- 
multiplication constant. Then we get: 
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Theorem 30. Let O be an jg-approximate separation oracle that runs in time T. Then FlND- 
Polynomial has time complexity 0((Td 2 + d u+2 ) log 2 ^). 

Proof. The running time of FlNDPOLYNOMIAL is dominated by the binary search. It is easy to 
see that the binary search performs 0 (log iterations, in which the main operation is the call 
to the separation oracle method. Our bounds on the ball radii in Theorems 27 and 28 imply 
L = 0{d + log ^). Combining this with the running time bound for Vaidya’s algorithm gives the 
time complexity stated in the theorem. □ 

In Section 6 we describe a //-approximate separation oracle that runs in time 0{dk+d log log 1 / p+ 
s), where s is the number of samples in the empirical distribution on the interval J. Plugging this 
oracle directly into our algorithm FlNDPOLYNOMIAL gives an //-approximate ^.-projection oracle 
which runs in time 0((d 3 k + d 3 log log 1/?/ + sd? + dP +2 ) log 2 ^). This algorithm is the algorithm 
promised in Theorem 18. 

6 The separation oracle and the *4/,-computation oracle 

In this section, we construct an efficient approximate separation oracle (see Definition 23) for the 
set C T over the interval J = [—1,1], We denote our algorithm by ApproxSepOracle. Let A be 
the ball defined in Lemma 27. We will show: 

Theorem 31. For all p > 0, ApproxSepOracle(c, //) is a p-approximate separation oracle for 
C T that runs in time 0(dk + dloglog^ + s), where s the number of samples in J, assuming all 
queries are contained in the ball A. 

Along the way we also develop an approximate ^-computation oracle ComputeAk. 

6.1 Overview of ApproxSepOracle 

ApproxSepOracle consists of two parts, TestNonnegBounded and AkSeparator. We show: 


Lemma 32. For any r < 2, given a set polynomial coefficients c £ A C M d+1 , the algorithm 
TestNonnegBounded( c, p) runs in time 0(d log 2 d(log 2 d+loglog 1///)) and outputs a separating 
hyperplane for C T or “yes”. Moreover, if there exists a point x G [—1,1] such that p c {x ) < —p, the 
output is always a separating hyperplane. 

We show in the next section that whenever c ^ C T the output is “ yes ”. 

Theorem 33. Given a set of polynomial coefficients c G A C M rf+1 such that p c (x) > —p for all 
x G [—1,1], there is an algorithm AkSeparator (c, p) that runs in time 0(dk+ (. s + d ) log 2 (s + d)) 
and either outputs a separating hyperplane for c from C T or returns “yes”. Moreover, if \\p c — /u||_ 4 fc > 
r + 2 p, the output is always a separating hyperplane. 
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ApproxSepOracle given TestNonnegBounded and AkSeparator Given Test- 
NonnegBounded and AkSeparator, it is straightforward to design ApproxSepOracle. 

We first run TestNonnegBounded(c, p). If it outputs a separating hyperplane, we return 
the hyperplane. Otherwise, we run AkSeparator(c, p), and again if it outputs a separating 
hyperplane, we return it. If none of these happen, we return “yes”. Lemma 32 and Theorem 33 
imply that ApproxSepOracle is correct and runs in the claimed time: 

0 (dlog 2 d(log 2 d + loglog 1///)) + 0(dk + (s + d) log 2 (s + d)) = 0(dk + dloglog 1/p + s) . 

In the following sections, we prove Lemma 32 and Theorem 33. In Section 6.2 we describe 
TestNonnegBounded and prove Lemma 32, and in Section 6.3 we describe AkSeparator and 
prove Theorem 33. 

6.2 Testing non-negativity and boundedness 

Formally, the problem we solve here is the following testing problem: 

Definition 34 (Approximate non-negativity test). An approximate non-negativity tester is an al¬ 
gorithm satisfying the following guarantee. Given a polynomial p = X^f=o c * a '* with maxj|cj| < a 
and a parameter p > 0 , return one of two results: 

• a point x G [—1,1] at which p(x) < — p/2. 

• “OK”. 


Moreover, it must return the first if there exists a point x' G [—1,1] so that p(x') < —p. 

Building upon the classical polynomial root-finding results of [PanOl], we show: 

Theorem 35. Consider p and p from Definition 3f. Then there exists an algorithm TestNonneg(p, p) 
that is an approximate non-negativity tester and runs in time 

0(d log 2 d • (log 2 d + log logo + loglog(l///))) , 

where a is a bound on the coefficients of p. 

This theorem is proved in Section B.2. 

We have a bound on the coefficients c since we may assume that c G A, and so we can use this 
algorithm to efficiently test non-negativity as we require. Our algorithm TestNonnegBounded 
simply runs TestNonneg(p c , p). If this returns ” yes ”, then TestNonnegBounded outputs ” yes ”. 
Otherwise, TestNonneg(p c , p) outputs a point x G [—1,1] such that p c (x ) < —p/2. In that case, 
TestNonnegBounded returns the hyperplane defined by y = — (1, x, x 2 ,..., x d ) T , i.e., p c (x) = 
—y T c. Note that for all d G C T we have p c '{y) > 0 and hence y T d < 0. This shows that 

y T d < 0 < | < -p c {x) = y T c 

as desired. 
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Proof of Lemma 32. The correctness of this algorithm follows from the correctness of TestNonneg. 
We therefore only bound the running time. The worst-case runtime of this algorithm is exactly the 
runtime of TestNonneg(j5 c , /a) for any c € A. Since we run TestNonneg(p c , fi) only when 
max ie[d] \a\ < 2 Lu = 2°^ (see Theorem 27) , the runtime of TestNonneg(p c , pt) is 

0(d log 2 d(log 2 d + log log 1 ///)), 


as claimed. 


□ 


6.3 An Afc-computation oracle 

We now consider the Afc-distance computation between two functions, one of which is a polynomial 
and the other an empirical distribution. In this subsection, we describe an algorithm ComputeAk, 
and show: 


Theorem 36. Given a polynomial p such that p(x) > —p for all x G [—1,1] and an empirical 
distribution f supported on s points, for any k < d, ComputeAk(p, /, k) runs in time 0((s + 
d) log 2 (s + d)), and computes a value v G M+ such that \v — ||p — /||_ 4 fc | < 2/r and a set of intervals 


/],..., A so that 


k 

£ | p(h) - m 

i= 1 


= V . 


Note that this theorem immediately implies Theorem 19. 


AkSeparator given ComputeAk: Before describing ComputeAk, we show how to design 
AkSeparator satisfying Theorem 33 given such a subroutine ComputeAk. 

The algorithm AkSeparator is as follows: we run ComputeAk(j) c , /, k), let v be its estimate 
for || p c — /||Afc) an< i l e t Aj * • • > Ik b e the intervals it produces. If v < r, we output “yes”. 

Otherwise, suppose 

k 

v = £bc(A) - /(A) | > t . 

i =1 

Note that if ||p c — /||.4fe > t + 2/.i, this is guaranteed to happen since v differs from ||p c — f\\_A k by 
at most 2/i. Let dj = sign(p c (A) — /(A))- Let = [aj,6,;]. Then 


£ bc(A) - 7(A) | = £ °i ( / Pc(x) dx - /(A) 
i =1 i =1 ^ ai y 

= £ ^£ ( 6 i +1 - a i +1 ) C - 7(A) j 


and therefore, 

k d k 

£ ^ (£' - a l +1 ) C J > T + £ ^7(A) • (11) 

i=l j=0 J i= 1 

Note that the left hand side is linear in c when we fix dj, and this is the separating hyperplane 
AkSeparator returns in this case. 
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Proof of Theorem 33 given Theorem 36. We first argue about the correctness of the algorithm. If 
|| p c — /||_ 4 fe > r + 2 p, then ComputeAk guarantees that 


v = ^2\Pc(Ii) ~ f{Ii)\ >t . 

i =1 


Consider the hyperplane constructed in (11). For any d £ C T 





1 

JTi 



d 

= E \p*Vi) - 

3=0 


< I Pc' - f\\A k < T , 


where the last inequality is from the definition of C T . Therefore this is indeed a separating hyperplane 
for c and C T . Moreover, given 7),..., 7*. and v, this separating hyperplane can be computed in time 
O(dk). Thus the entire algorithm runs in time 0(dk + (s + d) log 2 (s + d) as claimed. □ 


6.3.1 A Reduction from Continuous to Discrete 


We first show that our A&—computation problem reduces to the following discrete problem: For a 
sequence of real numbers c\,..., c r and an interval I = [a, b] in [r], let w(I) = show 

that our problem reduces to the problem DiscreteAk, defined below. 

DiSCRETEAK: Given a sequence of r real numbers {cj}[ =1 and a number k, find a set of k 
disjoint intervals 1 ),... , A, that maximizes 

k 

Ew)i • 

i= 1 

We will denote the maximum value obtainable ||{ci}||^ fc , i.e., 

ll{c»}IU fc = max^|w(/)| , 

lex 

where the X is taken over all collections of k disjoint intervals. 

We will show that it is possible to reduce the continuous problem of approximately computing 
the Ak distance between p and / to solving DiscreteAk for a suitably chosen sequence of length 
0{d). Suppose the empirical distribution / is supported at s points a < x\ < ... < x s < b in this 
interval. Let X be the support of /. Let p[a,j3] = p(x)dx. Consider the following sequences of 
length 2s + 1: 


m 


1 /n if i is even, 
0 if i is odd. 


j -F ( ii S c(/') 


p[x£,x e+ 1 ] if i = 27 + l, 
0 if i is even. 


where for simplicity we let so = a and = b. The two sequences are displayed in Table 2. 
Then we have the following lemma: 
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i 

1 

2 

3 

4 


2s 

2s + 1 

E(i) 

0 

1 

n 

0 

1 

n 


1 

n 

0 

Pc lisc(*) 

P[a,x i] 

0 

p[x\,X2] 

0 


0 

p[x s ,b] 


Table 2: The sequences E(i) and Pdisc(*)• 


Lemma 37. For any polynomial p so that p(x) > —pi on [—1,1] 


\\p-f\\A k -\\{P^c-E}\\ Ak \ 

Moreover, given k intervals maximizing ||{Pdisc — 

.1 \ . }y. so that 


< 2/j, . 

r)\U k , 


one can compute k intervals 


k 

E | - M) 

i =1 


||{-Pdisc 


e >IIa 


< 2/j 


in time 0{k). 

Proof. We first show that 


\\p-f\\A k >H P di S c-E}\\ Ak . 

Let I\,..., If,- be a set of disjoint intervals in [2d + 1] achieving the maximum on the RHS. Then it 
suffices to demonstrate a set of k disjoint intervals J±,... in / satisfying 

k k 



v{Ji) - f(Ji ) > E l p disc (/i) - E(Ii)\ 


( 12 ) 


i=l 


i =1 


We construct the Jj as follows. Fix i, and let = [aj,6*]. Define J* to be the interval from 
a* to bj. If a,; is even (i.e., if Pdisc(oi) — -E'(z) has only a contribution from —E(ai)), include the 
left endpoint of this interval from «/*, otherwise (i.e., if Tdisc( a i) — E(i) has only a contribution 
from -Pdisc(cq.))) exclude it, and similarly for the right endpoint. Then, by observation, we have 
P c ]i sc (4;) — E(Ii) = p(Ji) — f(Ji ), and thus this choice of Jj satisfies (12), as claimed. 

Now we show the other direction, i.e., that 

\\p-f\\ Ak <\\{P disc -E}\\ Ak +2p.. 

Let I\,..., If,; denote a set of disjoint intervals in / achieving the maximum value on the LHS. 
It suffices to demonstrate a set of k disjoint intervals J], in I satisfying 


E 


p(h) ~ f{h) 


sE 

1=1 


-Pdisc(^i) ~ E(Jj) | + 2 fi 


(13) 


We first claim that we may assume that the endpoints of each are at a point in the support 
of the empirical. Let a* and bj be the left and right endpoints of Ij, respectively. Cluster the 
intervals Ij into groups, as follows: cluster any set of consecutive intervals if it is the 

case that p(Ie) — f(Ie ) > 0 for t = j ,..., j', and [be,a£+i] contains no points of the empirical, for 
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I = j,... , f — 1. Put all other intervals not clustered this way in their own group. That is, cluster a 
set of consecutive intervals if and only if on all of them the contribution to the LHS is non-negative, 
and there are no points of the empirical between them. Let the clustering be X\,.... I/,./, and let J* 
be the smallest interval containing all the intervals in Xj. Let Ci and di denote the left and right 
endpoints of J,;, respectively. Associate to each cluster a sign a % E { — 1,1} which is the (unique) 
sign of p(I ) — /(/) for all I E Xj. Since p(i) > —p, this clustering has the property that for any 
cluster Xi, we have 

<p-\ji~ |j i\ • 

ieij 

Then, for all i. if a, = 1, take the interval I[ = ( Xj,X £) where Xj is the largest point in A so 
that Xj < Ci, and where X£ is the smallest point in X so that xi > dj. Then since p > p on [—1,1] 
and the new interval contains no points in the support of / which are not in U ieiX or Jj, we have 




i iEXi 


p(I') - /(/■) > p(Ji) - /(/•) - M \I[ — Ji\ > 2 p W - m ) U /eX T| - M \I[ - Ji\ • 

Alternatively, if cr, < 0, take the interval /' = \xj,X(\ where Xj is the smallest point in X so that 
Xj > Ci and X£ is the largest point in X so that xi < di. By the analogous reasoning as before we 
have that p(J') -/(/') < p{Ji) - f(Ji) + p\ Ji\, 6 and therefore \p(I[) -/(/')| + p\Ii\ > \p(Ji) - 
Thus, 


k k' 

E \p^ ~ /( 7 <)| < E (b( 7 <) - - u ^ J i + ^ \7 - J *l) 


i=i 


i= 1 
k' 


< E P(A) - ftt) 


+ 2/i . 


2=1 


since (|^i - u /ex^ + p\7~ M) < 2 as the intervals in the sum are disjoint subintervals in 

[- 1 , 1 ]- ^. 

Now it is straightforward to define the Jj. Namely, for each with endpoints x n < Xi 2 so that 
xp,Xi 2 E X, dehne 


[* 1 ,* 2 ] 

[*i + 1,*2] 
[*i ,*2 - 1 ] 

[ii + l,i 2 - 1] 


if x h ,x i2 E Af; 
if xp 0 A and Xj 2 € X; 
if xp E A and xp 0 A; 
if xq, Xj 2 0 A . 


One can check that with this definition of the Jp we have p(Ii) — f{h) = PdiscOl:) moreover, 

all the J* are discrete and thus this choice of Ji satisfies (6.3.1). 

Moreover, the transformation claimed in the lemma is the transformation provided in the first 
part of the argument, ft is clear that this transformation is computable in a single pass through 
the intervals 1). This completes the proof. □ 


6 Since each cluster with negative sign has exactly a single interval in the original partition, notationally we will 
not distinguish between Ji and the one interval in the original partition in Xj, when a t = —1. 
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6.3.2 Description of ComputeDiscreteAk 

For the rest of this section we focus on solving DiscreteAk. A very similar problem was considered 
in [Csu04] who showed an algorithm for the problem of computing the set of k disjoint intervals 
A- maximizing 

k 

!>(*) 

i= 1 

which runs in time 0{d ■ minjlogd, k }) time. We require a modified version of this algorithm which 
we present and analyze below. We call our variant ComputeDiscreteAk. 

Here is an informal description of ComputeDiscreteAk. First, we may assume the original 
sequence is alternating in sign, as otherwise we may merge two consecutive numbers without con¬ 
sequence. We start with the set of intervals Tq = /op < ... < Io. r - where Jo,* = [c,, q] contains only 
the point Cj. We first compute Jo and mo, where Jo is the set of k intervals / in To with largest 
|ic(/)|, and mo = Iteratively, after constructing T, = {i^i,..., we construct 

Zj + 1 by finding the set Ij :] with minimal \w(Ijj)\ amongst all intervals in Z,;, and merging it with 
both of its neighbors (if it is the first or last interval and so only has one neighbor, instead discard 
it), that is, 

Z; | 1 — {li, 1) • • • 11i,j— 2) Ii,j—1 U k.] U k.j- 1 ■ k/ j \ 2■ . • • , Ii,n\ ■ 

We then compute Ji+\ and m;.|_i where Ji+i is the set of k intervals I in Z,;+i with largest |u>(/)|, 

and ruj+i = x |w(/)|. To perform these operations efficiently, we store the weights of the 

intervals we create in priority queues. We repeat this process until the collection of intervals Z^ 

has < k intervals. We output Ji and Wi, where Wi is the largest amongst all computed in any 
iteration. An example of an iteration of the algorithm is given in Figure 1, and the formal definition 
of the algorithm is in Algorithm 4. 

0.8 -0.5 0.4 -0.1 0.3-0.4 0.5 

Iteration i : \ -1-1--1-1-—|-1-1 

0.8 -0.5 0.6 -0.4 0.5 

Iteration i + 1 : |-1-1--1-1-1 


Figure 1: An iteration of ComputeDiscreteAk. The numbers denote the weight of each interval. 
The interval with smallest weight (in absolute value) is chosen and merged with adjacent intervals. 
Note that if weights are of alternating signs at the start, then they are of alternating signs at each 
iteration. 

The following runtime bound can be easily verified: 

Theorem 38. Given {cj}[ =1 , COMPUTEDisCRETEAK({cj}[ =1 , k) runs in time 0(r ■ minjlogr, k}). 
The nontrivial part of the analysis is correctness. 

Theorem 39. Given {cj}l =1 and k, the set of intervals returned by the algorithm COMPUTE- 
DiscreteAk({q}1 =1 , k) solves the problem DiscreteAk. 
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Algorithm 4 Computing the discrete Ak norm of a sequence. 

1: function ComputeDiscreteAk({cj}[ =1 , k) 

2: Let Zo {[ci, ci], [c 2 1 C 2 ],..., [c r , c r }} be the initial set of intervals. 

3: Let Q be an empty priority queue. 

4: for I G Zo do 

5: Q.push(I,w(I )) 

6: end for 

7: * <- 0 

8: while |Z, | > k do 

9: Let I Q.deleteMinQ. 

10: if I is not the leftmost or rightmost interval then 

11: Let 1^ft and Irigh.t. be its left and right neighbors, respectively. 

12 : Q.remove(Ii e ft) 

13: Q.remove(I r ig ht ) 

14: Let I = Ileft U / U Iright 

15: Q.push(I',w(I')) 

16: end if 

17: i-k-i + l 

18: Let Z; be the elements of Q 

19: Let Ji be the k intervals in X, with maximum weight 

20: Let Wi = W (I) 

21: end while 

22: return Wj and Jj where Wj satisfies Wj > Wi for all i. 

23: end function 


Proof. Our analysis follows the analysis in [Csu04], We call any Z* which attains the maximum 
for the DlSCRETEAK problem a maximal subset, or maximal for short. For any two collections of 
disjoint intervals X', X" in [r], we say that X' is contained in X" if all the boundary points of intervals 
in X' are also boundary points of intervals in X". Figure 2 shows an example of two collections of 
intervals, one contained in the other. If there is a maximal Z* that is contained in Z we say that 
Z contains a maximal subset. We say that X' is atomic with respect to X" if every interval in X' is 
also in X". Figure 3 gives an example of two collections of intervals, one atomic with respect to the 
other. If there is a maximal Z* that is atomic with respect to Z then we say that the maximum is 
atomic with respect to Z. 

X' : |-1-1 I-1 |-1 

X" : I - H - 1 - 1 H - 1 I - H 

Figure 2: X' is contained in X" since each boundary point of all intervals in X' are boundary points 
of some interval in X". 

We will prove the following invariant of our algorithm: 
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X' : I-1-1 I-1 I-1 

X" : |-1-1-1 I-1 I-H 

Figure 3: X' is atomic with respect to X ", since each interval in X' is also an interval in X". 


Lemma 40. For any i > 0, if X % contains a maximal subset, then either the maximum is atomic 
with respect to Xj or Xj + i contains a maximal subset. 

Before we prove this lemma, let us see how it suffices to prove Theorem 39. Now the set Xq 
contains a maximal subset. By induction and Lemma 40, for all i, as long as the maximum is not 
atomic with respect to Xj, Xj+i contains a maximal subset. ComputeDiscreteAk stops iterating 
at iteration if if X, f has at most k intervals. At this point either the maximum was atomic with 
respect to some X,;, or Xi f contains a maximal subset. Let X* be any maximal subset it contains. 
We observe that 

Y M-01 ^ Y M J )I ’ 

lex* IeXi f 

and moreover, X lf has k pieces, so X, f is itself maximal, and is atomic with respect to itself. 

Thus, there is some i so that X, contains a maximal subset that is atomic with respect to X,;. 
Call this maximal subset X*. But then since it is atomic with respect to X,;, we have that 

Y K-01 - Y M J )I = mi ’ 

lei* ieji 

since Ji is chosen to maximize the sum over all sets of k intervals which are atomic with respect to 
X*. Since X* achieves the maximum for DiscreteAk, we conclude that m.j is indeed the maximum. 
Thus whatever rrijj we output is also the maximum, and its Jp attains the maximum. This completes 
the proof of Theorem 39 assuming Lemma 40. □ 

We now prove Lemma 40. 

Proof of Lemma fO. It suffices to show that if Xj contains a maximal subset, but the maximum is 
not atomic with respect to Xj, then Xj+i also contains a maximal subset. Thus, let X* be such that 

1. X* is maximal 

2. X* is contained in Xj, and 

3. there is no X\ ^ X* satisfying conditions (1) and (2) so that every interval in X\ is contained 
in some interval in X*. 

Such an X* clearly exists by the assumption on Xj. Note that X* cannot be atomic with respect to 
Xj. By observation we may assume that no interval V in a maximal subset will ever end on a point 
a so that w(I') and c a have different signs, since otherwise we can easily modify the partition to 
have this property while still maintaining properties (l)-(3). More generally, we may assume there 
does not exist an interval I" contained in I’ with right endpoint equal to I h s right endpoint (resp. 
left endpoint equal to I h s left endpoint) so that w(I') and w(I") have different signs. 
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Let Ij = [ 02 , fol be the interval in I t with minimal |w(/)| amongst all I £ X,;. WLOG assume 
that it is not the leftmost or rightmost interval (the analysis for these cases is almost identical and 
so we omit it). Let (3 1 be the left endpoint of Ij~\ and /?4 be the right endpoint of Ij+i- WLOG 
assume that w{Ij) < 0. 

The initial partition Iq had the property that the signs of the values w(I) for I £ Xd alternated, 
and through a simple inductive argument it follows that for all X;, the signs of the values w(I) 
for I £ X; still alternate. Thus, we have w(Ij-i), w(Ij+i) > 0. Since X* is not atomic, there 
is some I a £ X* which contains at least two intervals Ii,I -2 of X,;. Moreover, since the signs of 
the w(I) of the intervals in X; alternate, we may assume that w(Ii) and wfa) have different 
signs. Thus, by observation, we may in fact assume that I a contains three consecutive intervals 
I\ <I ‘2 < I 3 , and that w(I a ),w(Ii),w(l 3 ) have the same sign, and u>(/ 2 ) has a different sign. 
Moreover, |xc(Xy)| < min{r(;(/i), tc(/ 2 ), w(l 3 )}. Moreover, define l\ to be the interval which shares 
a left endpoint with I a , and which has right endpoint the right endpoint of I\, and /)( to be the 
interval which shares a right endpoint with I a , and which has left endpoint the left endpoint of I 3 
(See Figure 4). 


X* : 

X: : 


h 


"tT 


h h h 

u u 


"lT 


J 


Figure 4: The interval I a , and the intervals I\, I 2 , and I 3 . 

We must have that w(I^),w(I^) are the same sign as w(I a ) as otherwise, say if w(ll) 1 s sign was 
different from tc(/ a )’s sign, we would have \w(I a )\ < \w(I^)\ and so the existence of the collection 
of intervals I' = (X* \ I a ) U I% violates condition (3), since it is contained in X;, and 

i w ( 7 )i - i w ( J «)i + \ w (0\ > i w ( 7 )i > 

lei' lei* lei* 

so it is maximal. 

Since X* is contained in X,;, the only boundary points that intervals in X* can have in the interval 
[/?!,/? 4 ] are at the points /?* for i £ {1, 2, 3,4}. There are a few cases. 

Case 1 If no interval in X* has any boundary point at /?2 or /? 3 , then it is still contained in X+i, 
by the definition of X+i- 

Case 2 If [&, fJ 3 \ £ X*, define X' = (X* \ {[/3 2 , p 3 ],I a }) U { 4 1 , 1 2 a }. Then 

K 7 )i = i w ’( 7 ^ _ I w ( 7 j)i + K t ’( / 2)i > i w ( 7 )i 

/ex' /ex* /ex* 

by the choice of Ij, so T l is maximal, contained in X,;, and thus its existence violates condition (3), 
so this case is impossible. This is illustrated in Figure 5, where for simplicity I a contains precisely 
three intervals. 
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X* : 

T- ■ 1 

1 1 

1 1 1 

1 1 

II II 

1 1 r 

1 1 1 

Xj . j 

II II 

I\ h II 

1 1 ^ 

P2 ij Ps 


Figure 5: When Ij is an element of X*, we can drop it and add the intervals / f ] and 1% achieving a 
larger weight. 

Case 3 If X 3 is the right endpoint of some interval I 6 X*, then by the same reasoning as before, 
we may assume that w(I) < 0. Then, let X be the interval with the same left endpoint as I but with 
right endpoint f3 1. Since then w(I') = w(I) — w(Ij- 1) — w(Ij) < w(I), the partition X' = X* \ I U X 
is maximal, contained in Xj, and its existence again violates condition (3), so this case is impossible. 
An illustration is given in Figure 6. 


X ij -1 h 

£ = I-H-1-h 

Pi P 2 Pd 


H •• h 


+ ■■■ + 


H—I 


Figure 6: X* can drop I and instead take X to get a larger weight. 


Case 4 If p 2 is the left endpoint of some interval I E X*, then analogous reasoning to that in Case 
3 results in a contradiction, so this case is also impossible. 

Case 5a If P 2 is the right endpoint of some interval / £ X*, and no interval in X* contains Ij+i, 
then we know that w(I) > 0. Let X be the interval I U Ij U Ij+i■ Then, the partition T' = X* \/uX 
is maximal by the same kind of reasoning as before, and X' is contained in X 7+1 • Thus, this case is 
possible and consistent with the Lemma. 

Case 5b If /?2 is the right endpoint of some interval I G X* and /?3 is the left endpoint of 
some interval X G X*, then we know that w(I),w(I') > 0. Let I" = I U Ij U X. Define X 1 = 
(X* \ {XX, la}) U {X', Then, 

X] M J )I = X H 7 )l _ M J i)l + M j 2)I > X H 7 )l > 

/ex' /ex* /ex* 

so again this is a maximal subset which is now contained in X ]+ \. 

Case 6 If X 3 is the left endpoint of some interval in X*, by analogous reasoning to that in Cases 
5a and 5b, we may conclude that in this case, the Lemma holds. 

These cases encompass all possible cases, and thus we conclude that the Lemma holds, as 
claimed. □ 


39 



6.3.3 Description of ComputeAk 

Our algorithm ComputeAk uses Fact 41 below, produces the sequence P,a sc (i)—E(i), and computes 
||{P disc - £}|U fc using ComputeDiscreteAk. 

It thus suffices to show that we can construct this sequence Pdisc(*) ~ E(i) efficiently when we 
are given the empirical distribution and the polynomial p. The only difficulty lies in efficiently 
computing the p[xj,Xj+i]. This problem is equivalent to efficiently evaluating the integral of p at 
all the points in A, which is in turn equivalent to efficiently evaluating a degree d + 1 polynomial 
at s points. To do so, we use the following well-known fact: 

Fact 41 ([VZGG13], p. 299 and p. 245). Let x \,... ,x s be a set of s real numbers and let p be a 
polynomial of degree at most s. Then there is an algorithm that computes p{x i), ... ,p(x s ) in time 
0 (slog 2 s). 

After solving the discretized version of the problem, the algorithm outputs the estimate that 
COMPUTEDISCRETEAK computed and the processed version of the intervals, where the processing 
is the one described in Lemma 37. Thus, we have: 

Proof of Theorem 36. The correctness of the algorithm follows from Lemma 37 and the arguments 
given above. Thus it suffices to bound the running time. The time required to produce the sequence 
C’disc(^) — E(i) is bounded by computing the p[xi, Xi+ 1 ], which can be done in time 0((s + d) log 2 (s + 
d)) by Fact 41. Moreover, the running time of ComputeDiscreteAk on the sequence -Pdisc(*)~ E(i) 
is 0(s log s). Hence, the running time of the overall algorithm is 0((s+c?) log 2 (s+d)), as claimed. □ 

7 Applications 

In this section, we apply our main result to obtain near optimal estimators for various classes of 
structured distributions. As described in Table 1, we consider arbitrary mixtures of well-studied dis¬ 
tribution families, including log-concave distributions, normal distributions, densities with bounded 
number of modes, and density functions in Besov spaces. We also consider mixtures of discrete struc¬ 
tured distributions over an ordered domain, such as multi-modal distributions, monotone hazard 
rate distributions, Poisson, and Binomial distributions. For all these classes, our sample complexity 
and running time match the information-theoretic optimum, up to at most logarithmic factors. 

We note that even though our algorithm is stated for distributions over a known finite interval, 
they are also applicable to distributions over the entire real line, such as (mixtures of) Gaussians or 
Poisson distributions. This follows from the following fact: let x m j n and x max be the smallest and 
largest elements among log ^^ draws from any distribution. Then with probability at least 1 — 5, 
the distribution assigns probability mass at least 1 — e to the interval [x' m ; n , x max ]. Thus, at a cost of 
l°g(V<b sam pl eg) we may truncate the distribution and thereafter only consider this finite interval. 

7.1 Mixture of log-concave distributions 

For an interval I C M, a function g : I M is called concave if for any x, y £ I and A £ [0,1] 
it holds g (Ax + (1 — A )y) > A g(x) + (1 — A )g(y). A function h : I —> M+ is called log-concave if 
h(x) = exp (g(x)), where g : I —> M is concave. A density / is a fc-mixture of log-concave density 
functions if there exist w\, ... ,Wk > 0, = 1 and log-concave density functions /i, •••,/& 

such that / = ^2 mfi. The class of log concave distributions is very broad and contains the 
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class of Gaussians, uniform, exponential, Gamma, Beta, and Weibull distributions. Log-concave 
distributions have received significant interest in economics and statistics [BB05, CSS10, DR09, 
DW13, CS13, KS14, BD14, HW15], 

It was shown in [CDSS14a] that a ^-mixture of log-concave density functions can be e-approximated 
in Li-norm by a f-piecewise linear density, for t = 0(k/y/e). Using this structural result, [CDSS14a] 
gave a polynomial time algorithm with sample complexity 0(t/e 2 ) = 0(k/e 5//2 ) to agnostically learn 
a fc-mixture of log-concave distributions. This sample bound is nearly optimal, as D(/c/e 5//2 ) samples 
are necessary for this learning problem. 

Our main result yields a sample optimal and nearly-linear time algorithm for this problem. In 
particular, this follows from a combination of Theorem 1 and a recently obtained tight structural 
result that removes the logarithmic factors from the previous construction of [CDSS14a]. In partic¬ 
ular, it is shown in [DK15] that a ^-mixture of log-concave density functions can be e-approximated 
in Li-norm by a f-piecewise linear density, for t = 0(k/\/e). As a corollary, we obtain the following: 


Theorem 42. There is an agnostic learning algorithm for the class of k-mixtures of log-concave 
distributions over the real line that uses 0[k/e 5 / 2 ) samples and runs in time 0{{k/e 5 / 2 )). 

7.2 Mixture of Gaussians 

Let N (fi, a 2 ) denote the normal distribution with mean fi and variance o 2 . A density / : R —> M + is a 
Admixture of Gaussians if there exist w±,.. ., Wk > 0, Wi = 1, n±, ..., £l, and cri, ..., <7*. 6 R+ 

such that / = Ylt=\ a 2 ). 

In the theoretical computer science community, the problem of parameter estimation for Gaus¬ 
sian mixtures was initiated by [Das99]. Recent work has obtained polynomial sample and time 
algorithms for this problem under the conditions of identihability [MV10, BS10]. We remark that 
learning the parameters of a mixture of k univariate Gaussians to accuracy e requires U((l/e) 6fc_2 ) 
samples [HP 15] . 

The problem of proper learning for Gaussian mixtures has also been recently studied in [DK14, 
SOAJ14] who obtain algorithms that draw 0(k/e 2 ) samples and run in time 0(( l/e) 3 ^ 1 ). Another 
approach, due to [BSZ15], outputs a mixture of 0(k/e 3 ) Gaussians in time and sample complexity 
of 0(k/e e ). 

It is well-known (see, e.g., [Tim63, Section 7.21] or [CDSS14a]) that a normal distribution is 
e-close to a 3-piecewise polynomial of degree 0(log(l/e)). Using this structural result, [CDSS14a] 
obtain a nearly sample optimal and polynomial time agnostic learning algorithm for this problem. 

As a corollary of Theorem 1, we obtain a nearly sample optimal and nearly-linear time algorithm. 
(The sample complexity of our algorithm is better than that of [CDSS14a] by logarithmic factors.) 
In particular: 

Theorem 43. There is an agnostic learning algorithm for k-mixtures of univariate Gaussians that 
draws 0((k/e 2 ) log(l/e)) samples and runs in time 0(k/e 2 ). 

7.3 Densities in Besov spaces 

Densities in Besov spaces constitute a broad family of distributions, including piecewise polynomials 
and the exponential family. Density estimation for functions in Besov spaces has received consid¬ 
erable attention in the statistics and information theory literature. A lot of the early work on the 
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topic relied on wavelet techniques, based on the fact that functions in Besov spaces are amenable 
to multiscale decompositions [DeV98, DJKP96, DJ98]. 

A piecewise smooth density function / has the following decomposition, 

OO 

f ( x ) = ^2 c jo,k<l>j 0 ,k{ x ) + Y Y d i°^jo,k{x) 

k j=j 0 k 


where the </>’s are scaling functions and the ^’s are wavelet functions. The Besov space B^(L p ([ 0,1])) 
is the following subset of such density functions 


B a JL p {[ 0,l])) d = <; 


,/r\ V« 


f-\ l%,ilk+ | E **£i«w 

J=jo \ k 


< OO 


for parameters a > ^ > 0 and q > 0 , where {cj 0 ,k} and are the scaling and wavelet coefficients 

in the wavelet expansion of /. 

Nowak and Willett [WN07] showed that any density / in _B“(L p ([0,1])) for 0 < q < p, with 
t = a + 7), can be approximated up to L\ error e with n = O a ^ ) samples. They also 

propose an algorithm for this problem with running time ff(n 3 ). 

As a corollary of our main result, we obtain a sample optimal and nearly-linear time agnostic 
algorithm for this problem. A result in [DeV98] implies that under the above assumptions on 
a,p,q, any function in i?“(L p ([ 0 , 1 ])) can be e-approximated in Li-norm by an O a (e~ 1 ^ a )- piece 
degree-0( [a]) polynomial. Combined with our main result, we obtain an algorithm with sample 

complexity O a ^ 2 + 1 /n ), which is optimal up to constant factors [WN07]. Moreover, the running 
time of our algorithm is nearly-linear in the number of samples. In particular: 

Theorem 44. There is an agnostic learning algorithm for 0, 1])), with 0 < q < p, 1/p = 

a + 1/2 with sample complexity O a f e2 +i/a ) an d running time O a f 2 +i/a ) • 


7.4 Mixtures of t-monotone distributions 

A density / : M —> M_)_ is 1-monotone if it is non-increasing. It is 2-monotone if it is non-increasing 
and convex, and t-monotone for t > 3 if (—1)-?/0 is non-negative, non-increasing, and convex for 
j = 0,..., t — 2. A number of recent works in statistics studied the problem of estimating f-monotone 
density functions in the context of the MLE [BW07, GW09, BW10]. 

Implicit in [KL04, KL07] is the fact that any f-monotone bounded density function over [0,1] 
can be approximated with an 0(1/e 1 ^) piecewise degree f — 1 polynomial. Using this along with 
our main result yields the following guarantee on learning f-monotone distributions. 

Theorem 45. There exists an agnostic learning algorithm for k-mixtures of t-monotone distribu¬ 
tions that uses 0(tk/e 2+1 ^) samples and runs in time 0(kt 2+UJ /e 2+l ^). 

The above is a significant improvement in the running time compared to [CDSS14a], Note that 
for f = 1,2, the sample complexity of our algorithm is optimal. This follows from known lower 
bounds of fl(l/e 3 ) for f = 1 [Bir87a] and of U( 1/e 5 / 2 ) for f = 2 [DL01]. 
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7.5 Mixtures of discrete distributions 

Our main result applies to the discrete setting as well, leading to fast algorithms for learning mixtures 
of discrete distributions that can be well-approximated by piecewise polynomials. 

Mixtures of f-modal discrete distributions and MHR distributions. A distribution over 
[A] is unimodal if there is a j £ [A] such that the pmf is non-decreasing up to j, and non-increasing 
after j. A distribution is f-modal if there is a partition of [A] into at most f intervals over which 
the conditional pmf is unimodal. It follows from [Bir87b, CDSS13] that any mixture of k f-modal 
distributions is e-close to a (kt/e) log(A/fei)-histogram. [CDSS14b] implies an algorithm for this 
problem that uses n = 0(kt log(A)/e 3 ) samples and runs in time 0(n). As a corollary of our main 
result, we obtain the first sample optimal (up to constant factors) and nearly-linear time algorithm: 


Theorem 46. There is an agnostic learning algorithm for k-mixtures of t-modal distributions over 
[A"] that draws 0( ktl °sW kt ) ^ samples and runs in time Q^ ktlo s(^/ kt ) l 0 g(]y e )). 

We similarly obtain a sample optimal and near-linear time algorithm for learning mixtures of 
MHR distributions. 


For a distribution p on [A], the function H(i) — 


def p(i) 


is called the hazard rate function of 

2lj>iPW 

p. The distribution p is a monotone hazard distribution (MHR) if H(i ) is non-decreasing. [CDSS13] 
shows that a mixture of k MHR distributions over [A] can be approximated up to distance e 
using an 0(k log(A/e)/e)-histogram. Using this, [CDSS14b] yields a 0(fclog(A/e)/e 3 ) sample, 
0(fclog(A/e)/e 3 ) time algorithm to estimate mixtures of MHR distributions. We obtain 


Theorem 47. There is an agnostic learning algorithm for k-mixtures of MHR distributions over 
[A] that draws 0(AHog(A/e)/e 3 ) samples and runs in time O ( kl ° s [^ c 1 l Q g( 1/e )). 

Mixtures of Binomial and Poisson distributions. We consider mixtures of k Binomial and 
Poisson distributions. For these distribution families, the best sample complexity attainable using 
the techniques of [CDSS14a, CDSS14b] is 0{k/e 3 ). This follows from the fact that approximating a 
^-mixture of Binomial or Poisson distributions by piecewise constant distributions requires Q(k/e) 
pieces. 

A recent result of [DDS15] shows that any Binomial or Poisson distribution can be approximated 
to Li distance e using f-piecewise degree-d polynomials for f = 0(1) and d = 0(log(l/e)). Therefore, 
a Binomial or Poisson /e-mixture can be approximated with 0(/c)-piecewise, degree-0(log(l/e)) 
polynomials. Since our main result applies to discrete piecewise polynomials as well, we obtain the 
following: 


Theorem 48. There is an agnostic learning algorithm for k-mixtures of Binomial or Poisson dis¬ 
tributions that uses 0{\ log(l/e)) samples and runs in time 0(k/e 2 ). 


8 Experimental Evaluation 

In addition to the strong theoretical guarantees proved in the previous sections, our algorithm also 
demonstrates very good performance in practice. In order to evaluate the empirical performance of 
our algorithm, we conduct several experiments on synthetic data. We remark that the evaluation 
here is preliminary, and we postpone a more detailed experimental study, including a comparison 
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with related algorithms, to future work. Nevertheless, our results here show that both the empirical 
sample and time complexity are nearly optimal in a strong sense. For example, no histogram 
learning algorithm that requires sorted samples can outperform the running time of our method by 
more than 30%. Similarly, our learning algorithm for piecewise linear hypotheses only adds a factor 
of 2 — 3x overhead to the time needed to sort the samples. Moreover, the sample complexity of our 
algorithm matches the quantity t ■ (d+ l)/e 2 up to a small constant between 1 and 2. 

All experiments in this section were conducted on a laptop computer from 2010, using an Intel 
Core i7 CPU with 2.66 GHz clock frequency, 4 MB of cache, and 8 GB of RAM. We used Mac OS X 
10.9 as operating system and g++ 4.8 as compiler with the -03 flag (we implemented our algorithms 
in C+-(-)• All reported running times and learning errors are averaged over 100 independent trials. 
As an illustrative baseline, sorting 10 6 double-precision floating point numbers with the std::sort 
algorithm from the C++ STL takes about 100 ms on the above machine. 

Figure 7 shows the three distributions we used in our experiments: a mixture of two Gaussians, 
a mixture of two Beta distributions, and a mixture of two Gamma distributions. The three distri¬ 
butions have different shapes (e.g., different numbers of modes), and the support size considered for 
these distributions differs. 





Gaussian mixture density 


Beta mixture density 


Gamma mixture density 


Figure 7: The three test distributions. 


8.1 Histogram hypotheses 

In order to evaluate our histogram learning algorithm (see Section 4.1), we use the following test 
setup. For a given unknown distribution with pdf /, we draw n i.i.d. samples from the unknown 
distribution. We then give the sorted samples as input to our algorithm, which produces a histogram 
hypothesis h. We set the parameters of our algorithm so that the resulting histogram contains 80 
constant pieces. As performance measures, we record the running time of our algorithm (excluding 
sorting) and the Li-learning error achieved, i.e., ||/ — h\\ v 

Figure 8 contains the running time results, both on a linear scale and on a logarithmic scale. The 
results indicate three important points: (i) The running time of our algorithm scales nearly-linearly 
with the input size, i.e., the number of samples n. (ii) The constant hidden in the big-O notation 
of our analysis is very small. In particular, the algorithm runs in less than 35 milliseconds for 10 6 
samples. Note that this is three times faster than sorting the samples, (iii) The running time of our 
algorithm essentially does not depend on the unknown distribution. Such robustness guarantees are 
very desirable for reliable performance in practice. 

The Li-learning error results are displayed in Figure 9. The results show that the best learning 
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error achievable with 80-piece histograms depends on the shape of the underlying distribution: 2- 
GMMs are harder to approximate than the Beta and Gamma mixtures. This shows that for large 
number of samples, it is beneficial to use richer hypotheses classes such as piecewise linear functions 
(see the next subsection). Nevertheless, our algorithm exhibits a good decay of the learning error 
before the regime where OPTso dominates. 



0 2-10 5 4-10 5 6-10 s 8-10 5 MO 6 

Number of samples 



Number of samples 


Figure 8: Running times for density estimation with histogram hypotheses. The left plot shows the 
results on a linear scale, the right plot on a logarithmic scale. As predicted by our analysis, the 
running time of our algorithm scales nearly-linearly with the input size n. Moreover, the constant 
in the big-0 is very small: for n = 10 6 , our algorithm takes less than 35 milliseconds, which is about 
three times faster than sorting the samples. The running time performance of our algorithm is also 
essentially independent of the unknown distribution. 


8.2 Piecewise linear hypotheses 

Next, we turn our attention to the more challenging case of agnostically learning piecewise linear 
densities. This is an interesting case because, in contrast to the histogram algorithm, the piecewise 
linear algorithm requires our full set of tools developed in Sections 3-6. For the case of piecewise 
linear functions, the structure of the feasible set is still somewhat simpler than for general degree-d 
polynomials because the non-negativity constraint on a given interval can be encoded with two 
linear inequalities, i.e., the feasible set is a polytope instead of a spectrahedron. We use this 
additional structure in our piecewise linear algorithm. However, we did not implement further 
potential optimizations and resorted to an off-the-shelf linear program (LP) solver (GLPK, the 
GNU Linear Programming Kit) instead of a customized LP solver. We believe that the running 
time of our algorithm can be improved further by implementing a custom LP solver that better 
utilizes the structure and small size of our LPs (and also takes into account that we solve many 
such small LPs). 

We repeat the same experimental procedure as for piecewise histogram hypotheses, but use 40 
linear pieces this time. Figure 10 contains the running time results of our algorithm. Again, the 
results show three important points: (i) As predicted, the running time scales nearly-linearly with 
n. (ii) In spite of using an off-the-shelf LP solver, the constant factor in our running time is still 
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Figure 9: Learning error for density estimation with histogram hypotheses. The left plot shows 
the results on a linear scale, the right plot on a logarithmic scale. The results clearly show that 
some distributions such as 2-GMMs are harder to approximate with 80-piecewise constant hypothe¬ 
ses than others. Before the optimal learning error OPTso dominates, our algorithm nevertheless 
demonstrates a quickly diminishing learning error. 


good. In particular, our algorithm requires less than 0.3 seconds for 10 6 samples. This is only three 
times slower than the time required for sorting the samples. We believe that with a customized LP 
solver, we can bring this overhead down to a factor closer to two. (iii) Again, the running time of 
our algorithm is very robust and does not depend on the shape of the unknown distribution. 

Next, we consider the learning error achieved by our piecewise-linear algorithm, which is dis¬ 
played in Figure 11. Compared with the plots for piecewise constant hypotheses above, the results 
show that piecewise linear hypotheses can approximate the unknown densities significantly better, 
especially for the case of the 2-GMM. Three points are worth noting: (i) The slope of the curve 
in the log-scale plot is about —0.477. Note that this matches the \ term in our learning error 
guarantee q( *'G+ 1 ) ^ a l irlos t perfectly, (ii) Moreover, the constant factor achieved by our algorithm 
is close to 1. In particular, the learning error for the 2-GMM and n = 10 6 samples is roughly 
0.00983. Using this as e = 0.00983 together with t = 40 and d = 1 in gives about 830,000, 

which almost matches the n = 10 6 samples for which this error was obtained, (iii) The learning 
error of our algorithm is robust and essentially independent of the underlying distribution. 
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Figure 10: Running times for density estimation with piecewise-linear hypotheses. The left plot 
shows the results on a linear scale, the right plot on a logarithmic scale. As predicted by our 
analysis, the running time of our algorithm scales nearly-linearly with the input size n. Moreover, 
the constant in the big-O is quite small: for n = 10 6 , our algorithm takes less than 0.3 seconds, 
which is only three times slower than sorting the samples. Note that this means that no algorithm 
that relies on sorting the samples can be more than 4 times faster than our algorithm when the 
total running time with sorting is taken into account. As before, the running time of our algorithm 
is also essentially independent of the unknown distribution. 
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Appendix 

A Analysis of the General Merging Algorithm: Proof of Theo¬ 
rem 17 

This section is dedicated to the proof of Theorem 17. The proof is a generalization of that of 
Theorem 10. Recall the statement of Theorem 17: 

Theorem 17. Let O p and O c be ij-approximate Ak-projection and Ak~ computation oracles for 
T>. Algorithm General-Merging(/, t, a, e, (5) draws n = 0((adt + log l/5)/e 2 ) samples, runs in 
time O (( R p (n ) + R c (n )) log and outputs a hypothesis h and an interval partition T such that 
\T\ <2 a -t and with probability at least 1 — 5, we have 

\\h — /||i < 3 • OPTx> t H- — -b 2e + r/ . (10) 

a — 1 

Proof. We first bound the running time. The number of iterations of the algorithm is 0(log(n/ at)) 
by the same argument as for histograms, since the number of intervals reduces by a factor of 3/4 in 
each iteration. In each iteration, we compute the closest function in T> and the corresponding Ad+i 
distance, hence the runtime per iteration is bounded by R p (n ) + R c {n), by definition. 

We now prove the error guarantee. Let T = {I\,... ,1?} be the partition of I returned by 
General-Merging, and let h be the function returned. The desired bound on t' is immediate 
since the algorithm terminates only when t! < 2 at. We now prove (10). 

Let h* € T>t be such that || h* — f ||i = OP Tx>,t- Let X* = {1^,..., If} be a partition with 
at most t pieces such that h* € T>j* for all i. Call the end-points of It’s as jumps of h*. For 


[Wal09] 

[Weg70] 

[WN07] 

[WW83] 
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any interval J C I let T(J) be the number of jumps of h* in the interior of J. Since we draw 
n = Q((adt + log l/5)/e 2 ) samples, Corollary 4 implies that with probability at least 1 — 5, 

11/ /ll-4(2a + l)(d+l)t — ^ ‘ 

We condition on this event throughout the analysis. 

We split the total error into three terms based on the final partition X: 

Case 1: Let X be the set of intervals in X with no jumps in h*, i.e., X = { J £ X | T(J) = 0}. 

Case 2a: Let Jq be the set of intervals in X that were created in the initial partitioning step of the 
algorithm and contain a jump of h* , i.e., = {J £ X | T(J) > 0 and J £ Xo}. 

Case 2b: Let J\ be the set of intervals in X that contain at least one jump, and were created by 
merging two other intervals, i.e., J\ = {J £ X | T(J) > 0 and J ^ Xo}. 

Notice that X", Jq, J\ form a partition of /, and thus 


II h ~f ||i = || h - fh,T + || h - /||i,Jo + ||/» - /||l,Ji • 

We bound the error from above in the three cases separately. In particular, we will show: 

]\h - /IIIX <3-||/-/ilxx + 2 • ||/- /IU m , d+1) x + , (14) 

\\h - /Hi,jo < ||/— /IU,j- 0 |. (m+1) ,Jo , (15) 

ii* - fh,j, < 0 e + ii/- f\u d . Wll ,j, + ii/ - fciw, + ■ (“) 

Using these results along with the fact that \\f — h* IliX + 11/ “ h *\\i,Ji < OPT x>,t and a > 2, we 
have 


lift-/111 <3-OPTp ( + 0FTl y + ‘ + 2||/- /|U mM+1) + II/-/|U |Jol „ 


a — 1 


+ 11/ - /lUdj-ji+t,* + ^(l- 77 ! + 

( a ) OPTtt y- H - € 

< 3 • OPT®,* + + 2||/ - /|U 2at(d+1) + V 

(*>) OPT-n, + e 

< 3 • OPT-p t -\ -'-h 2e + rj , 

a — 1 

where (a) follows from Fact 6(d) and since (|X"| + \ Ji \ + |/o|) £ 2at, and (6) follows from the VC 
inequality. Thus, it suffices to prove Equations (14)-(16). 


Case 1. We first consider the set of intervals in X. By the triangle inequality we have 


1/ — f\\i,T < 11/ — h*\\i,jr + \\h — h*||i,jr . 


For any interval J £ X, since h and h* are both in V, they have at most d sign changes, and 
II h - h*\\ lt j = ||h - h*\\ Ad+uJ < ||h - f\\ Ad+1 ,j + ||/- h*\\ Ad+1 ,j. 
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By the definition of h and the projection oracle, 


\\h~ f\\A d+1 ,J < 




f\\A d+1 ,J + 


V 

2 at 


Therefore, 

\\h-h*\\ h j<2.\\h*-f\\ Ad+u j + ^- t . 

Again by the triangle inequality, 

II h* - f\\ Ad+lJ < || h* - f\\ Ad+1 ,j + ||/ - /|U d+1 ,j. 

Summing over the intervals in Z, 

E w h * - f\\A* + uJ < E w h * - + E11/ - 

J&T J&T J&T 

<n^-/iii^ + ii/-7iu m(d+1) ^ 

Combining these, we obtain, 

II h - f\\i,T < 3 • ||/ - h'hs + 2 • ||/ - /1U„ |(d+1) ^ + 2^| , 

which is precisely (14). 

Case 2a. We now analyze the error for the intervals Jq. The set Zq contains only singletons and 
intervals with no sample points. By definition, with probability 1, only the intervals in Zq that 
contain no samples may contain a jump of h*. The singleton intervals containing the sample points 
do not include jumps, and are hence covered by Case 1. Since Jq does not contain any samples, our 
algorithm assigns 

h(J) = f(J) = 0 

for any J 6 Jq. Hence, 

I/ 1, - /Hi ,Jo = H/Hl,^o 5 

and 

ll /7 - - /111 ,Jo = H/lll,Jo 

= £i/0)i 

J 

= £ i/m - 7mi 

J&Jo 

< Wf - f\\A lJMd+1) ,J 0 » 

where the last step simply follows from non-negativity of / — / over Jq. 
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Case 2b. We finally consider J \, the set of intervals created by merging in some iteration of our 
algorithm that also contain jumps. As before, our first step is the following triangle inequality: 


\\h -/||i^< ||/i-^*||i,j 1 + \\h* -/Ik*. 

Consider an interval .7 E J\ with T(J) > 1 jumps of h*. Since h E Vj, h — h* has at most 
d • r(J) sign changes in J. Therefore, 

11* - = lift - 

< II h - f\\A d . r(J)+1 ,J + 11/- /IU d .r(j)+i,J + 11/ “ ^*IU d .r(j)+i,J 

< nmh - f\u d+1 ,j +ii 7- f\u d . nJ)+1 ,j+ ii/ - ^iii,j, (it) 

where (a) follows from Fact 6(a), (6) is the triangle inequality, and inequality (c) uses Fact 6(c) 
along with the fact that T(J) > 1 and d > 1. We start by bounding the Ad+i distance in the first 
term above. 


Lemma 49. For any J G J\, we have 

\\h-f\\ Ad+ 1 ,J< 


OPT©* + e 


+ 


V 


(18) 


(a — l)t 2(cc — 1 )t 

Before proving this lemma, we use it to complete Case 2b. Summing (7) over J G J\ and 


plugging in the lemma, 


\\h-h*\\ hJl < ^E( r ( J )J •(— 

(“) OPT x>,t + e , V 


( OPT© ( + e 


+ 


V 


a — 1 )t 2 (a — 1 )t 


1 + E /IU d . r( j) + i.J + 11/ “ ^*l|l,Ji 

7 J&Ji 


+ 


(a — 1) 2(a-l) 


+ ll/-/lk.* +l ^i + ll/-^lli^i 


where the first term in (a) uses the fact that YljeJi r(<7) — 1 an d H 16 second term uses this in 
conjunction with Fact 6(d). 

We now prove Lemma 49. 


Proof of Lemma 49. Each iteration of our algorithm merges pairs of intervals except those with the 
at largest errors. Therefore, if two intervals were merged, there were at least at other interval pairs 
with larger error. We will use this fact to bound the error on the intervals in J\. 

Suppose an interval .7 G J\ was created in the jth iteration of the while loop of our algorithm, 
i.e., J = U 72 ij for some i G {1,..., Sj/2}. Recall that the intervals 7- J+1 , for 

i G {1,..., Sj/2}, are the candidates for merging at iteration j. Let h! be the distribution given 
by applying the projection oracle to the empirical distribution over each candidate interval T ’-_ ) _ 1 = 
{/} J+1 ,...,/'y 2 ^. +1 }. Note that h'(x) = h(x) for x G J since J remains intact through the 
remainder of the algorithm. 

As with the histogram estimation, for a class V with at most d sign changes, let ed(g,J ) = 
rnin ff / e x)j \\g — g'\\A d+1 - Let C be the set of candidate intervals 7' - +1 in the set Xj +1 with the largest 
a ■ t errors \\h! — /||^ d+1 . By the guarantee of projection oracle, 

W - 7m i+ „7 ij+1 < ej(7,/', + i) + 
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Let £ 0 be the intervals in C that do not contain any jumps of h*. Since h* has at most t jumps, 
|£o| > (a - 1 )t. 

Therefore, 


E w h ' 

~ /IU d+ iT < 

E 

( erf( ^’ /,) + tot) 


I'eCo 


I'ec o 



< 

E 

(ll^ - / Ud+iJ' + 

V \ 
2 at) 



i'ec o 



< 

II/- 

h*\\i,c 0 + \\f-f\\A 



< 

OPT©,* + e + 77 / 2 . 



Since h! is h on the interval J, combining with |£q| > (a — l)t, we obtain 


\\h'-f\\ 


A d +i,J 


\\ h ~ f\\A d+1 ,J < 


OPT©,* + 2e 7/ 

(a — l)t 2 (a — l)t 


completing the proof of the lemma. 


□ 


B Additional Omitted Proofs 


B.l Proof of Fact 26 


We first require the following classical lemma, first proved by Markov [Mar92], For completeness, 
we include an elegant proof by the mathoverflow user fedja 7 . We remark that the bounds in the 
following fact are essentially tight. 

Fact 50 ([Mar92]). Let p(x ) = X^=o c j x ^ be a degree-d polynomial so that |p(.x)| < 1 for all 
x G [—1,1]. Then maxy \cj\ < (\/2 + l) rf for all j = 0,..., d. 

Proof. We first claim that \cj\ < max^gu \p(z)\ where D is the unit complex disc. To see this, we 
notice that by Cauchy’s integral formula, 



1 

27 xi 



p(0 

<57+1 


d( 


where we also changed the order of differentiation and integration and used 


d pjQ = f- -p(0 

dxi C ~ x (C — x)i +1 

7 See http://mathoverflow.net/questions/97769/approximation-theory-reference-for-a-bounded- 

polynomial-having-bounded- coefficie 
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Therefore, we get 


c,- = 


1 

[ 

p( 0 

- ( 

2vr 

j |CI=1 

(j+ 1 

1 

f 

p{ 0 

27T j 

NCl=i 

^■J+l 


d( 


< max \p{z)\ . 
ICI=i 


Consider the function 


F(z) 


= z~ m p 


z + z 


-1 


On the domain {z : \z\ > 1}, this function is analytic. So by the maximum modulus principle, it is 
bounded by its value on the unit circle. Since for all z£D, (z + z~ 1 )/ 2 = K(z), we conclude that 
|^ ? (^)| < maxjgpi^i p(x) < 1 by assumption. Thus we have that 

V—1 ' 


p 


z + z 


< z a 


for all \z\ > 1. Fix any w G D. It is straightforward to see that w = (z + z~ 1 )/ 2 for some z G C\{0}; 
by symmetry of z and z we conclude that this also holds for some z with \z\ > 1. For each w, 
arbitrarily choose such a z and denote it z w . Moreover, for all \z\ > (y/2 + 1), we have 


z + z 


-i 


> 


Z - Z 


-ll 


> ^ + 1 - 7 fc > 1 


and thus we conclude that for all w G D we have that its corresponding z w satishes \z w \ < y/2 + 1 
and therefore \p(w)\ = \p((z w + z“ 1 )/2)| < z£, < (\/2 + l) d , as claimed. □ 

The above statement is for polynomials that are uniformly bounded on [—1,1], We will be 
interested in bounds for polynomials that integrate to a fixed constant. In order to relate these 
bounds, we use the following classical result. 

Fact 51 (Bernstein’s Inequality [Che82]). Let p be a degree-d polynomial and let p' be its derivative. 
Then 

max \p(x)\ < d 2 ■ max \p(x)\ . 

*€[-1,1] *e[-i,i] 

With these results, we are now ready to prove Lemma 26. 

Proof of Lemma 26. Consider the degree-(d + 1) polynomial P such that P( — 1) = 0 and P' = p. 
This implies that P(x) = ff ] p(y) dy. Since p is non-negative on [—1,1], the bound on f\p(y)dy 
then gives 

max \P(x)\ < a ■ (y/2 + \) d . 

*€[-1,1] 

Using Bernstein’s Inequality (Fact 51), we can convert this bound into a bound on P' = p. i.e., we 
get that \p(x)\ < t ■ (d + 1) for all x G [—1,1]. Combining this uniform bound on p with Fact 50 
gives the desired bound on the coefficients of p. □ 
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B.2 Proof of Lemma 34 


Our approach to proving Lemma 34 is relatively straightforward. Assume we had an algorithm A 
that finds the roots of p exactly. Then one could perform a non-negativity test by running A to find 
the roots of //. which correspond to the extrema of p. Given the extrema of p. it suffices to check 
whether p is non-negative at those points and the endpoints of the interval. 

However, such an exact root-finding algorithm A does not exist in general. Nevertheless, there 
are efficient algorithms for finding the approximate roots of p in certain regimes. We leverage these 
results to construct an efficient non-negativity test. Before we proceed, we remark brieffy that 
we could also utilize the univariate SOS algorithm [Sho87, LasOl, Par03], which is arguably more 
elementary than our approach here, but slower. 

Formally, we build on the following result. 

Fact 52 ([PanOl], Part II, Theorem 1.1). Let D denote the complex unit disc. For all v > 0, 
there exists an algorithm FindRoots(c/, (5) satisfying the following guarantee: given any degree-d 
polynomial q(z ) : C —> C with roots z\. .. ,Zd such that Zj £ D for all i and f3 > dlogd, returns 
z\,...,z* d so that | z* — Zj | < 2 2 ~P/ d for all j. Moreover, FindRoots runs in time 0(dlog 2 d ■ 
(log 2 d + log/3)). 

Our polynomials do not necessarily have all roots within the complex unit disc. Moreover, we are 
only interested in real roots. However, it is not too hard to solve our problems with the algorithm 
from Fact 52. We require the following structural result: 

Fact 53 ([Hen 88 ], Sect. 6.4). Let q(x) = x d + Cd~ix d ~ l + ... + c\x + cq be a monic polynomial of 
degree d (i.e., the leading coefficient is 1). Let p{q) denote the norm of the largest zero of q. Then 

p(q) < 2 max \cd-i\ l/l . 
l<i<d 


In order to use the result above, we process our polynomial p so that it becomes monic and still 
has bounded coefficients. We achieve this by removing the leading terms of p with small coefficients. 
This then allows us to divide by the leading coefficient while increasing the other coefficients by a 
controlled amount only. Formally, we require the following definitions. 

Definition 54 (Truncated polynomials). For any degree-d polynomial p = o c iX l an d v > 0 let 


A = A (p,v) 


max 




Ci\ > 


-1 

2d) 


and let n = A u be the operator defined by 


A(p, v) 

(np)(x) = ^2 c i x% - 

i =0 

Formally, n acts on the formal coefficient representation of p as q = 'YfciX 1 . It then returns a 
formal representation Cix\ In a slight abuse of notation, we do not distinguish between the 

formal coefficient representation of p and the polynomial itself. Then Facts 52 and 53 give us the 
following: 
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Lemma 55. There exists an algorithm FastApproxRoots (p,u,p) with the following guarantee. 
Let p be a polynomial as in Definition 34, and, let u, p, > 0 such that u < (where a and d are as 

in Def. 34). Then FastApproxRoots returns approximate roots x \, ■ ■ ■, x *A^p v) £ M so that for 
all real roots y ofYi u p, there is some j so that \y — Xj\ < p. Moreover, FastApproxRoots runs 
in time 0(d log 2 d ■ (log 2 d + log log a + loglog(l/^) + loglog(l/p))). 

Proof. FastApproxRoots (p,u,p) proceeds as follows. We find A = A (p,u) and lip = IRp in 
time 0(d) by a single scan through the coefficients c* of p. Let q\(x) = (lip)(x). Note that the 
roots of q\ are exactly the roots of lip. Then, by Theorem 53, we have that 


. uei .. 

A = 2 max 

i<i<A 


CA-i 


c A 


1/i 

> p(qi) ■ 


The quantity A is also simple to compute in a single scan of the c t . Notice that we have 

2 ad 


(0 

CA—i 

Z max 



CA 


.1 < 


v 

B 


by the definition of A and the assumption that the Cj are bounded by a (Definition 34). Let B 
denote the right hand side of the expression above. If we let q{x) = <71 (Ax), we have that the roots 
of q all he within the complex unit disc. Let z\,... ,za be the roots of lip. Then the roots of q are 
exactly z\/A, ..., za/A. Run FlNDRoOTS(g, 2d + dlogB + dlog(l/p)), which gives us z\, ..., z\ 
so that for all i, we have \z* — Zi/A\ < p/B. Thus, for all i, we have 

I Az* - z\ < Aj^ < p . 

FastApproxRoots( p, V, p) returns the numbers x* = iR(Az*). For any real root x of lip, there 
is some z* so that | Az\ — x\ < p, and thus \x* — x\ < p as well. Thus, we output numbers which 
satisfy the conditions of the Lemma. Moreover, the runtime of the algorithm is dominated by the 
runtime of FindRoots((/, 2d + dlogB + d\og{l/p)), which runs in time 

0(dlog 2 d ■ (log 2 d + log(dlogR + dlog(l/p)))) = 

0(d log 2 d ■ (log 2 d + log log a + log log(l/iv) + log log(l///))) 


This completes the proof. □ 

Proof of Lemma 34- Let v = and let v' = 4Qt ^ +1 ) • Set 

A (p,v) 

r = (II u p)(x) = ^2 CiX 1 . 

i= 1 

We can compute the coefficients of r in time 0(d). Moreover, II(r / (x)) = r'(x). Let xi,... ,Xd >, 
where d' < A, be the roots of r'(x ) in [—1,1], These points are exactly the local extrema of r on 
[—1,1], Our algorithm TestNonneg(p, p) then is simple: 

1. Run FastApproxRoots( r, A, p ) and let x*,..., x^ be its output. 


60 



2. Let J = {i : x* G [—1,1]} and construct the set S' = { — 1,1} U {xi : i G J}. 

3. Denote the points in S by Xo = —1 < x\ < ... < x>/_ i < x>/ = 1, where d' < A + 1. 

4. Evaluate the polynomial p at the points in S' using the fast algorithm from Fact 41. 

5. If at any of these points the polynomial evaluates to a negative number, return that point. 
Otherwise, return “OK”. 

The running time is dominated by the call to FastApproxRoots. By Lemma 55, this algorithm 
runs in time 0(d log 2 d ■ (log 2 d + log log a + loglog(l//x))) as claimed. 

It suffices to prove the correctness of our algorithm. Clearly, if p is nonnegative on [—1,1], it 
will always return “OK”. Suppose there exists a point y G I so that p{y) < — p . 

For a function /, and an interval I = [a,b ], let \ f\oo,l = sup xe / ||/(x)|j. Then, 
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(19) 


where the inequality (a) follows from the choice of A. Thus r(y) < —3^/4. Since the points 
xo,x\, ..., x ( i' are extremal for r on /, there exists a 0 < j < d' so that r{xj) < — 3p/ 4. If j = 0 
(resp. j = 77?/), so if r(—1) < —3p/4 (resp. ?’(1) < —3/r/4), then by Equation (19), we have 
p(— 1) < p/2 (resp. p{ 1) < —p/2). Thus our algorithm correctly detects this, and the polynomial 
fails the non-negativity test as intended. 

Thus assume j € {1,..., A}. By Lemma 55, we know that there is a x* ( so that |x| — Xj\ < v'. 
Since Xj 6 I, either i G J or \xj +1| < v' or \xj — 1| < i/, so in particular, there is a point s G S so 
that |Xj — s\ < v'. Since for all x G [—1,1], we have 


d 

Ip'MI < EI icix l \ < ad{d + 1) 

i= 1 


by the bound on the coefficients of p (see Definition 34). By a first order approximation, we have 
that 


| p(xj) — p(-s)| < ad(d + 1)| Xj — s| < p/4 


where the last inequality follows by the definition of v'. Thus, we have that p(s) < —p/2, and we 
will either return s or some other point in s' G S with p(s') < p(s). Thus our algorithm satishes 
the conditions on the theorem. □ 


C Learning discrete piecewise polynomials 

Throughout this paper we focused on the case that the unknown distribution has a density / 
supported on [—1,1], and that the error metric is the Li-distance with respect to the Lebesgue 
measure on the real line. We now show that our algorithm and analysis naturally generalize to the 
case of discrete distributions. 

def 

In the discrete setting, the unknown distribution is supported on the set [ N] = {1,..., N}, and 
the goal is to minimize the ^-distance between the corresponding probability mass functions. The 
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^i-norm of a function / : [N] —> M is defined to be ||/||i = |/(*)| and the -distance between 

f,9 : [N] -A M is ||/ - g\\i. 

In the following subsections, we argue that our algorithm also applies to the discrete setting 
with only minor adaptations. That is, we can agnostically learn discrete piecewise polynomial 
distributions with the same sample complexity and running time as in the continuous setting. 

C.l Problem statement in the discrete setting 

Fix an interval I C [. N]. We say that a function p : I —> R is a degree-d polynomial if there is a 
degree-d real polynomial q : M —> R such that p(i) = q(i) for all i € I. We say that h : [IV] —> M is 
a t-piecewise degree-d polynomial if there exists a partition of [N] into t intervals so that on each 
interval, h is a degree-d polynomial. Let be the set of t-piecewise degree-d polynomials on [N] 
which are nonnegative at every point in [N]. Fix a distribution (with probability mass function) 

/ : [N] —> M. As in the continuous setting, define OPT dl j c ( = min^-pd isc ||^ — /Hi. • As before, our 
goal is the following: given access to n i.i.d. samples from /, to compute a hypothesis h so that 
probability at least 9/10 over the samples, we have \\h — f\\ l < C ■ OPT dl J c + e , for some universal 
constant C. As before, we let / denote the empirical after taking n samples. 

Our algorithms for the continuous setting also work for discrete distributions, albeit with slight 
modifications. For the case of histogram approximation, the algorithm and its analysis hold verbatim 
for the discrete setting. The only difference is in the definition of flattening; Definition 8 applies to 
continuous functions. For a function / : [N] — > M and an interval J C [n] the flattening of / on J is 
now defined to be the constant function on J which divides the total t\ mass of the function within 
J uniformly among all the points in J. Formally, if J = {a,..., b}, we define the flattening of / on 
J to be the constant function fj(x) = 

C.2 The algorithm in the discrete setting 

Our algorithm in the discrete setting is nearly identical to the algorithm in the continuous setting, 
and the analysis is very similar as well. Here, we only present the high-level ideas of the dis¬ 
crete algorithm and highlight the modifications necessary to move from a continuous to a discrete 
distribution. 


C.2.1 The ^Ifc-norm and general merging in the discrete setting 


We start by noting that the notion of the Afe-norm and the VC inequality also hold in the discrete 
setting. In particular, the A^-norrii of a function / : [N] —> M is defined as 
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where the maximum ranges over all 7),..., 1^ which are disjoint sub-intervals of [N]. 

The basic properties of the Afc-norm (i.e., those in Lemma 6) still hold true. Moreover, it is 
well-known that the VC inequality (Theorem 2) still holds in this setting. These properties of the 
Mfe-norm are the only ones that we use in the analysis of GENERALMERGING. Therefore, it is readily 
verified that the same algorithm is still correct, and has the same guarantees in the discrete setting, 
assuming appropriate approximate ^-projection and M/ c -cornputation oracles for polynomials on 
a fixed subinterval of [. N]. 
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C.2.2 Efficient ^-projection and computation oracles for polynomials 

We argue that, as in the continuous setting, we can give efficient ^.-projection and computation 
oracles for non-negative polynomials of degree d on a discrete interval I, using an O ((l)-dirnensional 
convex program. By appropriately shifting the interval, we may assume without loss of generality 
that the interval is of the form [m] = {1,..., m} for some m < N. 

The Convex Program As in the continuous case, it can be shown that the set of non-negative 
polynomials p on [m] satisfying \\p — /||. 4 fc < r is convex (as in Lemma 21), for any fixed r > 0 (since 
|| • |_ 4 fc is a norm). Moreover, using explicit interpolation formulas for polynomials on [m], it is easy 
to show that every polynomial in this feasible region has a representation with bounded coefficients 
(the analogue of Theorem 27), and that the feasible region is robust to small perturbations in the 
coefficients (the analogue of Theorem 28). Thus, it suffices to give an efficient separation oracle for 
the feasible set. 

The Separation Oracle Recall that the separation oracle in the continuous case consisted of two 
components: (i) a non-negativity checker (Subsection 6.2), and (ii) a fast ^-computation oracle 
(Subsection 6.3). We still use the same approach for the discrete setting. 

To check that a polynomial p : I —> M with bounded coefficients is non-negative on the points in 
/, we proceed as follows: we use Fast-Approx-Roots to find all the real roots of p up to precision 
1/4, then evaluate p on all the points in I which have constant distance to any approximate root of 
p. Since p cannot change sign in an interval without roots, this is guaranteed to find a point in I 
at which p is negative, if one exists. Moreover, since p has at most d roots, we evaluate p at 0(d) 
points; using Fact 41, this can be done in time 0(dlog<iloglogd). 

Finally, to compute the ^-distance between p = o c j x ^ aR d / on an interval I, we use the 
same reduction as in Section 6.3 with minor modifications. The main difference is that between two 
points Xi,Xi+i in the support of the empirical distribution, the quantity p[xi, Xj+i] (see section 6.3) 
is now defined to be 


Xi + l-l 

p[xi,x i+ 1 ] = Y P( £ ) 

£=Xi+l 
Xi+i —1 d 

t=Xi + l j= 0 
d /Si+l-1 \ 

E M • 

1=0 y^=Xj+l ) 

Notice that the above is still a linear expression in the Cj. and there are simple closed-form expres¬ 
sions for ^ or a ^ integers a, (5 and for all (1 < j < d. Following the arguments in Section 

6.3 with this substituted quantity, one can show that the quantity returned by ApproxSepOracle 
in the discrete setting is still a separating hyperplane for p and the current feasible set. Moreover, 
ApproxSepOracle still runs in time 0(d). 
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